CN115862615A

CN115862615A - Display device, voice search method and storage medium

Info

Publication number: CN115862615A
Application number: CN202211428652.6A
Authority: CN
Inventors: 刘蔚; 王娜; 马宏
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-03-28

Abstract

The disclosure relates to a display device, a voice search method and a storage medium, and relates to the technical field of voice interaction. The display device includes: a user input interface configured to: acquiring user voice; a controller configured to: recognizing the voice of a user, and acquiring keywords in the voice of the user, wherein the keywords comprise a media asset name, a media asset type and/or a verb; determining candidate media assets from the target knowledge graph library according to the media asset names, wherein the candidate media assets comprise first media assets indicated by the media asset names and media assets to be fed back and having an association relation with the first media assets; and determining target assets matched with the asset types and/or verbs from the candidate assets, and controlling a display to display the search results corresponding to the target assets. The embodiment of the disclosure is used for solving the problem that the user intention is difficult to accurately identify by the existing voice searching method.

Description

Display device, voice search method and storage medium

Technical Field

The present disclosure relates to the field of bluetooth technologies, and in particular, to a display device, a voice search method, and a storage medium.

Background

At present, the situation that a user searches contents such as a hot-cast television series and songs by using voice is more and more, and along with the improvement of the intellectualization of electronic products, the requirement of the user on the understanding capability of the artificial intelligence is higher and higher. In real life, it may be difficult for a user to accurately describe the assets desired to be searched, for example, the user may confuse the title a of a tv series with the title B of the theme song of the tv series, and the voice indicates: the television feeds back Music Video (MV) of the drama theme song to the user according to the name B of the drama theme song included in the voice indication when the television wants to watch the drama theme song B, and the voice searching method has the problems that the real intention of the user is difficult to accurately identify, the actual demand of the user is deviated, and the use experience of the user is influenced.

Disclosure of Invention

In order to solve the above technical problem or at least partially solve the above technical problem, the present disclosure provides a display device, a voice search method, and a storage medium, which can accurately recognize a user's intention and improve a user experience.

In order to achieve the above object, the embodiments of the present disclosure provide the following technical solutions:

in a first aspect, the present disclosure provides a display device comprising:

a user input interface configured to: acquiring user voice;

a controller configured to: recognizing the voice of a user, and acquiring keywords in the voice of the user, wherein the keywords comprise a media asset name, a media asset type and/or a verb;

determining candidate media assets from the target knowledge graph library according to the media asset names, wherein the candidate media assets comprise first media assets indicated by the media asset names and media assets to be fed back and having an association relation with the first media assets;

and determining target assets matched with the asset types and/or verbs from the candidate assets, and controlling a display to display the search results corresponding to the target assets.

In a second aspect, the present disclosure provides a voice search method, including:

acquiring user voice;

recognizing the user voice, and acquiring keywords in the user voice, wherein the keywords comprise a media asset name, a media asset type and/or a verb;

In a third aspect, the present disclosure provides a computer-readable storage medium comprising: the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the speech search method as shown in the second aspect.

In a fourth aspect, the present disclosure provides a computer program product comprising a computer program which, when run on a computer, causes the computer to implement the speech search method as shown in the second aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

the embodiment of the disclosure provides a display device, a voice search method and a storage medium, wherein a controller of the display device identifies a user voice acquired by a user input interface to obtain keywords including a media asset name, a media asset type and/or a verb, candidate media assets, a first media asset indicated by the media asset name in the candidate media assets and a to-be-fed media asset in an association relation with the first media asset are determined from a target knowledge spectrum library according to the media asset name, then a target media asset matched with the media asset type and/or the verb is determined from the candidate media assets, and a display is controlled to display a search result corresponding to the target media asset. Under the condition that the name of the media asset searched by the voice of the user is not matched with the type and/or verb of the media asset, the matched result of the media asset searching can be fed back to the user accurately, so that the voice of the user can be analyzed and understood accurately, the real intention of the user can be identified, and the use experience of the user can be improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic view of a scenario in some embodiments provided by embodiments of the present disclosure;

fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 according to an exemplary embodiment;

fig. 3 illustrates a hardware configuration block diagram of the display apparatus 200 according to an exemplary embodiment;

fig. 4 is a schematic diagram of a software configuration in the display device 200 according to one or more embodiments of the present disclosure;

fig. 5 is a schematic system architecture diagram of a display device according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram of a voice interaction network architecture according to an embodiment of the present disclosure;

FIG. 7 is a flowchart illustrating a voice search method provided in an embodiment of the present disclosure;

FIG. 8 is a first schematic diagram of a user interface for voice searching provided by an embodiment of the present disclosure;

fig. 9 is a schematic diagram of a video knowledge atlas database provided in the present disclosure;

FIG. 10 is a second schematic diagram of a user interface for voice searching provided by an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a display device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments of the present disclosure may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

The terms "first," "second," "third," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. The specific meaning of the above terms in the present disclosure can be understood in specific instances by those of ordinary skill in the art. Further, in the description of the present disclosure, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated object, indicating that there may be three relationships, for example, a and/or B, which may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

At present, the situation that a user searches for media content such as dramas and songs on a television through voice is increasing, along with the wide application of intelligent televisions, the requirement of the user on the understanding capability of the user on the artificial intelligence is also increasing, and in actual life, the user often confuses the names of the dramas and the theme songs of the dramas, and uses the names of the theme songs of the dramas to perform voice search when the user wants to search for the dramas, or uses the names of the dramas to perform voice search when the user wants to search for the theme songs of the dramas.

For example, the user inputs an instruction to the television voice, that is, "i want to watch tv drama" five centuries to live and then live ", but" five centuries to live and then live "is not the name of tv drama, but is the theme song of tv drama" kangxi dynasty ", which indicates that the true intention of the user is" i want to watch tv drama "kangxi dynasty"; however, the intention of the user is difficult to accurately understand by the television, and the MV of the song of borrowing five centuries for the day can be fed back to the user after the identification instruction of 'i want to watch the TV series of borrowing five centuries for the day', which deviates from the actual requirement that the user want to watch the TV series of Kangxi dynasty, and influences the use experience of the user.

The actual media assets are various in types, the names of the media assets are the same or similar, but the corresponding media asset contents and the corresponding media asset types are different, so that the difficulty of searching the media assets through the voice of a user is increased, and the user cannot accurately search the expected media assets through the voice.

To solve some or all of the above technical problems, an embodiment of the present disclosure provides a display device, a voice search method, and a storage medium, where the display device includes a user input interface and a controller, the user input interface is configured to obtain a user voice, and the controller is configured to: firstly, recognizing the acquired user voice to obtain the media asset name, the media asset type and/or verb and other keywords included in the user voice, then determining candidate media assets from a target knowledge spectrum library according to the media asset name, wherein the candidate media assets include first media assets indicated by the media asset name and media assets to be fed back which have an association relation with the first media assets, further determining target media assets matched with the media asset type and/or verb from the candidate media assets, and controlling a display to display a search result corresponding to the target media assets. Therefore, under the condition that the name of the media asset searched by the user voice is not matched with the type and/or verb of the media asset, the matched media asset searching result is accurately fed back to the user, the voice of the user is accurately analyzed and understood, the real intention of the user can be identified, and the use experience of the user is improved.

Fig. 1 is a schematic view of a scenario in some embodiments provided by embodiments of the present disclosure. As shown in fig. 1, fig. 1 includes a control apparatus 100, a display device 200, a smart device 300, and a server 400. The user can operate the display device 200 through the smart device 300 or the control apparatus 100 to play the audio and video resources on the display device 200.

Taking the example that the user operates the display apparatus 200 through the control device 100, the user operates the display apparatus 200 through the control device 100 to open a user input interface, such as a microphone, so that the display apparatus 200 acquires the user voice. The user expects the voice to control the display device 200 to play the media asset, the user input interface of the display device 200 receives the user voice, and then the controller of the display device 200 recognizes the user voice to obtain the keywords for the voice to include: and determining candidate medium resources from a target knowledge chart library according to the medium resource names, the medium resource types and/or verbs, wherein the candidate medium resources comprise first medium resources indicated by the medium resource names and medium resources to be fed back which are in incidence relation with the first medium resources, determining target medium resources matched with the medium resource types and/or verbs from the candidate medium resources, and further controlling a display to display a search result corresponding to the target medium resources.

Compared with the prior art that the search is carried out only according to the media asset name, the method and the device for searching the target media asset based on the voice of the user determine the candidate media assets from the target knowledge map library based on the media asset name to correct the deviation indicated by the voice of the user and reduce the media asset range, and then calculate and obtain the target service parameters according to the media asset type and/or verb, so that the service type of the target media asset really expected by the user is determined, the real intention of the user is accurately identified, the search result of the target media asset with the media asset name expected by the user and matched with the media asset type and/or verb is obtained, the voice search is more accurate, the actual requirements of the user are better met, and the use experience of the user is improved.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes an infrared protocol communication, a bluetooth protocol communication, a wireless or other wired method to control the display device 200. The user may input a user command through a key on a remote controller, voice input, control panel input, etc. to control the display apparatus 200. In some embodiments, mobile terminals, tablets, computers, laptops, and other smart devices may also be used to control the display device 200.

In some embodiments, the smart device 300 may install a software application with the display device 200, implement connection communication through a network communication protocol, and implement the purpose of one-to-one control operation and data communication. The audio and video content displayed on the intelligent device 300 can also be transmitted to the display device 200, so that the display device 200 with the synchronous display function can also perform data communication with the server 400 through multiple communication modes. The display device 200 may be allowed to be communicatively connected through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display apparatus 200. The display device 200 may be a liquid crystal display, an OLED display, or a projection display device. The display apparatus 200 may additionally provide an intelligent network tv function that provides a computer support function in addition to the broadcast receiving tv function.

Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 according to an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction from a user and convert the operation instruction into an instruction recognizable and responsive by the display device 200, serving as an interaction intermediary between the user and the display device 200. The communication interface 130 is used for communicating with the outside, and includes at least one of a WIFI chip, a bluetooth module, NFC, or an alternative module. The user input/output interface 140 includes at least one of a microphone, a touch pad, a sensor, a key, or an alternative module.

Fig. 3 shows a hardware configuration block diagram of the display apparatus 200 according to an exemplary embodiment. The display device 200 shown in fig. 3 includes: a tuner demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a user input interface 280, memory, a power supply, and the like. The controller 250 includes a central processing unit, a video processor, an audio processor, a graphic processor, a RAM, a ROM, a first interface to an nth interface for input/output, among others. The display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen. The tuner demodulator 210 receives a broadcast television signal through a wired or wireless reception manner, and demodulates an audio/video signal, such as an EPG data signal, from a plurality of wireless or wired broadcast television signals. The detector 230 is used to collect signals of the external environment or interaction with the outside. The controller 250 and the tuner-demodulator 210 may be located in different separate devices, that is, the tuner-demodulator 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box.

In some embodiments, the display device is a terminal device with a display function, such as a television, a mobile phone, a computer, a learning machine, and the like.

In some embodiments, the controller 250 controls the operation of the display device and responds to user operations through various software control programs stored in memory. The controller 250 controls the overall operation of the display apparatus 200. A user may input a user command on a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.

An output interface (display 260, and/or audio output interface 270) configured to output user interaction information;

the communicator 220 is a component for communicating with an external device or a server according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The display apparatus 200 may establish transmission and reception of control signals and data signals by the server 400 through the communicator 220.

The user input interface 280 may be used to receive external control signals.

The detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for collecting ambient light intensity; alternatively, the detector 230 includes an image collector, such as a camera, which may be used to collect external environment scenes, attributes of the user, or user interaction gestures, or the detector 230 includes a sound collector, such as a microphone, which is used to receive external sounds.

The sound collector can be a microphone, also called a microphone or a microphone, and can be used for receiving the sound of a user and converting a sound signal into an electric signal. The display device 200 may be provided with at least one microphone. In other embodiments, the display device 200 may be provided with two microphones to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the display device 200 may further include three, four or more microphones to collect sound signals and reduce noise, and may further identify sound sources and perform directional recording functions.

In addition, the microphone may be built in the display device 200, or the microphone may be connected to the display device 200 by wire or wirelessly. Of course, the position of the microphone on the display device 200 is not limited in the embodiments of the present application. Alternatively, the display apparatus 200 may not include a microphone, i.e., the microphone is not provided in the display apparatus 200. The display device 200 may be externally connected to a microphone (also referred to as a microphone) via an interface (e.g., the USB interface 130). The external microphone may be fixed to the display device 200 by an external fixing member (e.g., a camera holder with a clip).

The disclosed embodiment provides a display apparatus 200, and the display apparatus 200 includes:

a user input interface 280 configured to: acquiring user voice;

a controller 250 configured to: recognizing the voice of a user, and acquiring keywords in the voice of the user, wherein the keywords comprise a media asset name, a media asset type and/or a verb;

and determining target assets matched with the asset types and/or verbs from the candidate assets, and controlling the display 260 to display the search results corresponding to the target assets.

The display device 200 corrects the possible deviation of the voice indication of the user by identifying the name of the media asset and the type and/or verb of the media asset in the voice of the user so as to accurately identify the intention of the user, avoid the situation that the user is difficult to obtain an accurate media asset search result due to the confusion of the name of the media asset, improve the user friendliness and improve the use experience of the user.

In some embodiments, the target knowledge spectrogram library is at least one of preset knowledge spectrogram libraries; the preset knowledge map library comprises: the system comprises a first error correction knowledge map library, a second error correction knowledge map library and a video and audio knowledge map library; the first error correction knowledge map library comprises media assets, wherein the pronunciation similarity of the media asset names is greater than a first similarity threshold value, but the media asset types are different; the second error correction knowledge map comprises media assets with the media asset name pronunciation similarity larger than a second similarity threshold value and the media assets with different media asset contents, and the second similarity threshold value is larger than a first similarity threshold value; the video and audio knowledge map comprises video media assets and music media assets, and the video media assets and the music media assets have corresponding relations.

In some embodiments, the number of target assets is multiple;

a controller 250 for controlling the display 260 to display the search result corresponding to the target asset, configured to: acquiring historical search records, and determining first sequencing weights of a plurality of target media assets according to the historical search records; acquiring resource heat parameters of a plurality of target media assets, and determining second sequencing weights of the plurality of target media assets according to the resource heat parameters of the plurality of target media assets; calculating a target sorting weight according to the first sorting weight and the second sorting weight; and controlling the display 260 to display the search result corresponding to the target media asset according to the target sorting weight.

In some embodiments, controller 250, determining a target asset matching the asset type and/or verb from the candidate assets, is configured to: determining a service parameter corresponding to the type of the media asset and/or the verb and a service parameter corresponding to the actual type of the first media asset to calculate a target service parameter, wherein the target service parameter is a service parameter corresponding to the voice of the user; and determining target media resources from the candidate media resources according to the target service parameters.

In some embodiments, controller 250, determining a target asset from the candidate assets based on the target traffic parameter, is configured to: determining a second medium resource from the medium resources to be fed back according to the target service parameter; if the type of the second media asset is the same as the type of the first media asset, controlling the display 260 to display a search result corresponding to the first media asset and a search result corresponding to the second media asset; and if the type of the second medium asset is different from the type of the first medium asset, controlling the display 260 to display the search result corresponding to the second medium asset.

In some embodiments, the target knowledge atlas database is a second error correction knowledge atlas database; the second error correction knowledge map comprises media assets with different media asset contents and the pronunciation similarity of the media asset names is larger than a second similarity threshold;

before the controller 250 determines the service parameter corresponding to the asset type and/or verb and the service parameter corresponding to the actual asset type of the first asset to calculate the target service parameter, it is further configured to: judging whether the types of the first media assets and the media assets to be fed back are the same;

a controller 250, configured to determine a service parameter corresponding to the asset type and/or verb and a service parameter corresponding to an actual asset type of the first asset to calculate a target service parameter, and configured to: and under the condition that the types of the first medium resources and the medium resources to be fed back are different, determining the service parameters corresponding to the types of the medium resources and/or verbs and the service parameters corresponding to the actual types of the medium resources of the first medium resources so as to calculate the target service parameters.

In some embodiments, the controller 250, after recognizing the user speech and obtaining the keyword in the user speech, and before determining the candidate asset from the target knowledge gallery according to the asset name, is further configured to: judging whether the media asset name corresponds to the media asset type and/or whether the media asset name corresponds to the verb; and if the media asset name does not correspond to the media asset type and/or the media asset name does not correspond to the verb, determining the target knowledge map library according to the media asset name.

Fig. 4 is a schematic diagram illustrating a software configuration in a display device 200 according to one or more embodiments of the present disclosure, and as shown in fig. 4, the system is divided into four layers, which are, from top to bottom, an Application (Applications) layer (referred to as an "Application layer"), an Application Framework (Application Framework) layer (referred to as a "Framework layer"), an Android runtime (Android runtime) and system library layer (referred to as a "system runtime library layer"), and a kernel layer. The inner core layer comprises at least one of the following drivers: audio drive, display driver, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (like fingerprint sensor, temperature sensor, pressure sensor etc.) and power drive etc..

In some examples, an operating system of the intelligent device is an Android system as an example, as shown in fig. 5, fig. 5 is a schematic system architecture diagram of a display device according to an embodiment of the present disclosure, and the display device 200 may be logically divided into an application (Applications) layer (referred to as "application layer") 21, a kernel layer 22, and a hardware layer 23.

As shown in fig. 5, the hardware layer may include the controller 250, the communicator 220, the detector 230, and the like shown in fig. 3. The application layer 21 includes one or more applications. The application may be a system application or a third party application. For example, the application layer 21 includes a voice recognition application, which may provide voice interaction interfaces and services for connection of the display device 200 with the server 400.

The kernel layer 22 acts as software middleware between the hardware layer and the application layer 21 for managing and controlling hardware and software resources.

In some examples, the kernel layer 22 includes a detector driver to send voice data collected by the detector 230 to a voice recognition application. Illustratively, when the voice recognition application in the display device 200 is started and the display device 200 establishes a communication connection with the server 400, the detector driver is configured to transmit the voice data input by the user, collected by the detector 230, to the voice recognition application. The speech recognition application then sends query information containing the speech data to the intent recognition module 202 in the server. The intention recognition module 202 is used to input the voice data transmitted by the display device 200 to the intention recognition model.

For clarity of explanation of the embodiments of the present disclosure, a speech recognition network architecture provided by the embodiments of the present disclosure is described below with reference to fig. 6.

Referring to fig. 6, fig. 6 is a schematic diagram of a voice interaction network architecture according to an embodiment of the present disclosure. In fig. 6, the display device is used to receive input information and output a processing result of the information. An Automatic Speech Recognition (ASR) module is deployed with a Speech Recognition service for recognizing audio as text; a semantic Understanding (NLU) module is deployed with a semantic Understanding service and used for performing semantic analysis on a text; a business instruction management service is deployed in a Dialogue Management (DM) module and used for providing business instructions; the Language Generation module is deployed with a Language Generation service (NLG) and used for converting instructions for instructing the display device to execute into a text Language; the voice synthesis module is deployed with a voice synthesis (Text To Speech, TTS) service, and is used for processing a Text language corresponding To the instruction and then sending the processed Text language To a loudspeaker for broadcasting. In one embodiment, there may be multiple entity service devices deployed with different business services in the architecture shown in fig. 6, and one or more function services may also be aggregated in one or more entity service devices.

In some embodiments, the following describes an example of a process for processing information input to a display device based on the architecture shown in fig. 6, where the information input to the display device is a voice instruction input by voice:

the voice recognition display device may perform noise reduction processing and feature extraction on the audio of a voice command after receiving the voice command input through voice, where the noise reduction processing may include steps of removing echo and ambient noise.

Semantic understanding natural language understanding is performed on the identified candidate texts and associated context information by using an acoustic model and a language model, and the texts are analyzed into structured and machine-readable information, information such as business fields, intentions, word slots and the like so as to express semantics and the like. Deriving an actionable intent determination intent confidence score, a semantic understanding module selects one or more candidate actionable intents based on the determined intent confidence score,

the semantic understanding module issues an execution instruction to a corresponding service management module according to a semantic analysis result of a text of the voice instruction so as to execute an operation corresponding to the voice instruction, complete the operation requested by a user, and feed back the execution result of the operation corresponding to the voice instruction.

For more detailed explanation of the present solution, the following description is provided in an exemplary manner with reference to fig. 6, and it is understood that the steps involved in fig. 7 may include more steps or fewer steps in actual implementation, and the order between the steps may also be different, so as to enable the voice search method provided in the embodiment of the present disclosure.

As shown in fig. 7, fig. 7 is a schematic flow chart of a voice search method provided in the embodiment of the present disclosure, where the method includes the following steps S701 to S704:

and S701, acquiring the voice of the user.

In some embodiments, the display device obtains the user voice through the user input interface, and also obtains the user voice input by the user through a voice device externally connected with the user input interface.

In some embodiments, after the user speech is acquired by the user input interface of the display device, the user speech is preprocessed, including but not limited to at least one of: denoising, human voice extraction, which the present disclosure does not limit.

In some embodiments, a user opens a user input interface of a display device through a control device or an intelligent device, and the display device displays a user interface of a voice search to remind the user to start speaking, as shown in fig. 8, fig. 8 is a schematic view of the user interface of the voice search provided by the embodiment of the present disclosure, in which a microphone icon is shown to remind the user to start speaking with the display device. Fig. 8 is an exemplary illustration only, and the user interface for voice search is not particularly limited by the present disclosure.

S702, recognizing the voice of the user and acquiring the keywords in the voice of the user.

Wherein, the keywords comprise the names of the media assets, the types of the media assets and/or verbs; it will be appreciated that the keywords may include asset names and asset types, or include asset names and verbs, or include asset names, asset types and verbs.

Illustratively, the asset name may be "borrow five hundred years again to the day", "honey deposition embers such as frost", etc., the asset type may be "drama", "song", "movie", "hedonic", etc., and the verbs may be "play", "watch", "listen", "introduce".

In some embodiments, in the process of recognizing the user speech, the user speech is first converted into text characters, and the text characters are subjected to word segmentation processing to obtain keywords.

Taking table 1 as an example, the user voices and the keywords included therein are shown in table 1.

TABLE 1

In some embodiments, the display device may be connected to the server through the communication module, and send the acquired user speech to the server, so that the server recognizes the user speech, and the display device receives a recognition result of the server on the user speech. Of course, the recognition of the user voice may be performed by the display device, as described in step S702 and any embodiment included therein, or only part of the voice information that needs to be processed by the server may be sent to the server, which is not limited in this disclosure.

In the embodiment, the media asset name and the media asset type and/or verb included in the user voice are obtained by performing voice recognition on the user voice, so that the real intention included in the user voice is accurately understood according to the keyword.

And S703, determining candidate media assets from the target knowledge map library according to the media asset names.

The target knowledge map library is at least one knowledge map library in a preset knowledge map library. The preset knowledge map library comprises: the system comprises a first error correction knowledge map library, a second error correction knowledge map library and a video and audio knowledge map library.

The first error correction knowledge map library comprises media assets with the media asset name pronunciation similarity larger than a first similarity threshold but different media asset types. The first similarity threshold is a preset similarity threshold for distinguishing whether pronunciations of media asset names are similar, and can be set to 60% in a common case, and the pronunciation similarity refers to the situations that front and back nasal sounds are similar and tone is similar, for example, a television series 'deep emotion coming' and a song 'shallow emotion coming' belong to the same first error correction knowledge map library.

The second error correction knowledge map library comprises media assets with the pronunciation similarity of the media asset names larger than a second preset similarity threshold but different media asset contents. The second similarity threshold is a preset similarity threshold for distinguishing whether the pronunciations of the media asset names are the same, and the second similarity threshold is greater than the first similarity threshold and is usually set to 100%, for example, "sound life staying" and "life staying", the former is a synthesis art, and the latter is music.

The video and audio knowledge map library comprises video media resources and music media resources, and the video media resources and the music media resources have corresponding relations, such as video media resources 'Kangxi dynasty' and corresponding music media resources 'Suiyianyu five centuries', video media resources 'fragrant honey deposition and embers such as frost' and corresponding music media resources 'left-handed moon' and 'Do not pollute' and the like. It is emphasized that film and video assets include, but are not limited to, television shows, movies, art shows, documentaries, operas, and the present disclosure is not limited thereto.

Each knowledge map library in the preset knowledge map library takes each media resource as a node, and the association relationship between the media resources is a side.

Exemplarily, as shown in fig. 9, fig. 9 is a schematic diagram of a video and audio knowledge map library provided in an embodiment of the present disclosure, where nodes in the diagram include video and audio assets: kangxi dynasty, xiang Mi sinking and ember such as frost, and music media: "Sunday Suiyan five centuries", "left-hand-pointing to the moon" and "Do not stain", wherein "Kangxi dynasty" and "Sunday five centuries" have corresponding relations, and "Kangxi dynasty" and "Sunday five centuries" are nodes, and the corresponding relations between the two nodes are sides, and the lengths of the sides between the nodes are related to the strength of the corresponding relations, and the disclosure is not described herein in detail with specific reference to the prior art; the correspondence relationship exists between the 'Xiang Mi sinking and embering such as Shuang' and the 'left-handed digital moon' and 'Dai', respectively takes the 'Xiang Mi sinking and embering such as Shuang', the 'left-handed digital moon' and 'Dai' as nodes, and the correspondence relationship between the three is a side.

The candidate media assets comprise a first media asset indicated by the media asset name and a to-be-fed media asset which has an association relation with the first media asset.

In some embodiments, after a user speech is recognized to obtain a media asset name included in the speech, a quantized flag value corresponding to the media asset name is determined, and if the flag value is 0, it indicates that a first media asset indicated by the media asset name is single in media asset type and media asset content, and other media assets which can be confused with the first media asset do not exist, then the embodiment of the disclosure provides an implementation manner that a preset knowledge spectrum library is not called, the first media asset indicated by the media asset name is obtained by directly using word library labels and text logic reasoning in a main dictionary, and a display device controls a display to display a search result of the first media asset. Wherein the search result of the first medium resource includes but is not limited to: the label of the first media asset, the detail information of the first media asset and the media asset content of the first media asset.

Illustratively, the flag value corresponding to the asset name in the voice of the user is determined to be 0, the search result of the first asset indicated by the asset name is directly fed back to the user, and the display displays a poster and a summary of the first asset, wherein the poster is a picture link of the detail page of the first asset. It will be appreciated that the user jumps to the first asset details page by clicking on the first asset poster to view the details of the first asset.

The method includes the steps that under the condition that a flag value is 1, it is indicated that other media similar to pronunciations of the media names exist, or the media names have other corresponding media, and an implementation mode is provided.

Further, a first medium resource indicated by the medium resource name in the target knowledge spectrum library and a medium resource to be fed back having an association relation with the first medium resource are taken as candidate medium resources, and optionally, the first medium resource indicated by the medium resource name in the first error correction knowledge spectrum library and the medium resource to be fed back having a pronunciation similarity larger than a first similarity threshold but different medium resource types are taken as the candidate medium resources. Or, the candidate assets comprise: the method comprises the steps of providing a first media asset indicated by a media asset name in a video knowledge chart library, and feeding back the media asset to be fed back, wherein if the first media asset is a video media asset, the media asset to be fed back is a music media asset corresponding to the first media asset; if the first media asset is music media asset, the media asset to be fed back is the video media asset corresponding to the first media asset. Or, the candidate assets include: the method comprises the steps that a first media asset indicated by a media asset name in a first error correction knowledge map library, and a to-be-fed media asset with a pronunciation similarity larger than a first similarity threshold but different media asset types are provided, as well as a first media asset indicated by the media asset name in a video and audio knowledge map library and a to-be-fed media asset with a corresponding relation with the first media asset, wherein if the first media asset is a video media asset, the to-be-fed media asset is a music media asset corresponding to the first media asset; if the first media asset is music media asset, the media asset to be fed back is the video media asset corresponding to the first media asset.

Under the condition that the flag value is 2, the fact that other media resources with the same pronunciation as the media resource exist in the media resource name is shown, the embodiment of the disclosure provides an implementation mode, under the condition that the flag value is 2, the target knowledge map library is determined to be a second error correction knowledge map library, and the first media resource indicated by the media resource name in the second error correction knowledge map and the media resource to be fed back, which has the pronunciation similarity with the media resource name larger than a second similarity threshold but has different media resource contents, are taken as candidate media resources.

In some embodiments, after a user voice is recognized to acquire a keyword included in the voice, under the condition that the keyword includes a media asset name and a media asset type, judging whether the media asset name corresponds to the media asset type; or judging whether the media asset name corresponds to the verb under the condition that the keyword comprises the media asset name and the verb; or, under the condition that the keyword comprises the media asset name, the media asset type and the verb, judging whether the media asset name corresponds to the media asset type and the verb.

If the name of the media asset and the type of the media asset do not correspond to each other, the deviation between the content described by the voice of the user and the actually expected searched media asset is represented.

If the asset name does not correspond to the verb, or the asset name does not correspond to the asset type and the verb, the implementation is the same as the implementation in which the asset name does not correspond to the asset type, which is not described herein again.

Illustratively, the user voice is ' i want to watch a tv drama, ' borrow five centuries every day ', the name of the acquired media asset is ' borrow five centuries every day ', the type of the media asset is ' tv drama ', the verb is ' watch ', but ' borrow five centuries every day ' is the name of a song, and the type of the media asset is ' song ', which means that the name of the media asset in the user voice does not correspond to the type of the media asset, the content described by the user voice does not accord with the media asset expected by the user voice, and then the audio-visual knowledge spectrum library is inquired from the preset knowledge spectrum library according to the name of the media asset so as to acquire candidate media assets from the audio-visual knowledge spectrum library.

S704, determining target media assets matched with the media asset types and/or verbs from the candidate media assets, and controlling a display to display the search results corresponding to the target media assets.

In some embodiments, the service parameters of the asset type and the verb are preset. The service parameters include, but are not limited to: video service parameters, music service parameters, encyclopedia service parameters, which are not specifically limited by this disclosure.

As shown in table 2, table 2 shows service parameters corresponding to part of the preset media asset types and service parameters corresponding to verbs.

TABLE 2

It should be noted that, the values of the service parameters corresponding to the media asset names are not shown in table 2, and the actual media asset type of the first media asset indicated by the media asset names can be obtained by performing a search according to the media asset names, for example, the actual media asset type of the media asset name "borrow five centuries every day" is a song. The actual asset type of the first asset is determined according to the asset name, which is specifically referred to in the prior art and is not described herein in detail.

In some embodiments, according to a preset service parameter of a media asset type and a verb, a service parameter corresponding to a media asset type (which will be described as a "first media asset type" hereinafter, and is distinguished from an actual media asset type of a first media asset) included in a user voice and/or a service parameter corresponding to a verb are determined, and a service parameter corresponding to the actual media asset type of the first media asset is determined, and then a target service parameter is calculated according to the service parameter corresponding to the first media asset type and/or the service parameter corresponding to the verb and the service parameter corresponding to the actual media asset type, where the target service parameter includes at least one type service parameter.

Illustratively, the user voice is "i want to watch a tv drama" borrow five centuries every day ", wherein a video service parameter corresponding to the first media asset type" tv drama "is 0.5, a video service parameter corresponding to the verb" watch "is 0.5, a video service parameter corresponding to an actual media asset type song" borrow five centuries every day "is 0, and a music service parameter corresponding to the actual media asset type song is 0.5, so that the obtained target service parameter includes that the video service parameter is 1, the music service parameter is 0.5, which indicates that the user desires to watch a video, and a media asset name included in the voice indicates that the user desires to listen to a song, and after comparing the video service parameter and the music service parameter included in the target service parameter, it is determined that the user's true intention is to watch a video, which indicates that the media asset name included in the user's voice does not match the media asset actually desired by the user.

Further, after the target service parameter is obtained through calculation, the target medium resource is determined from the candidate medium resources according to the target service parameter, and optionally, the target medium resource can be determined from the candidate medium resources according to a larger service parameter in the case that the target service parameter includes at least one service parameter. The target medium resources comprise second medium resources determined from the medium resources to be fed back, or further comprise first medium resources.

The embodiment of the disclosure provides an implementation manner, and the second medium resources are determined from the medium resources to be fed back according to the target service parameters. It is to be understood that the target service parameter comprises at least one service parameter, for example the target service parameter comprises a video service parameter and a music service parameter. And screening the media assets to be fed back according to the at least one target service parameter, and taking the media assets to be fed back with the matched target service type as a second media asset, wherein the target service type is the type corresponding to the target service parameter, such as video service and music service. And further, comparing whether the type of the second medium resources is the same as the type of the first medium resources, if so, indicating that the first medium resources and the second medium resources are both target medium resources, and controlling to display the search result corresponding to the first medium resources and the search result corresponding to the second medium resources. If the two media assets are different, displaying a search result corresponding to the second media asset, wherein the second media asset is represented as a target media asset and is really expected to be searched by the user.

Following the above example, after the target service parameter is obtained by calculation, it is determined that the name of the media asset included in the user speech is not matched with the media asset actually expected to be searched by the user, and there is a deviation, and a media asset to be fed back, which has an association relation with "borrow five centuries every day" is determined from the candidate media assets obtained in step S603: and (3) taking the Kangxi dynasty as a second medium resource, feeding the Kangxi dynasty as a target medium resource back to the user to display a search result corresponding to the Kangxi dynasty as the medium resource type of the Kangxi dynasty is TV drama, the medium resource type of the Kangxi dynasty is song and the medium resource type of the Kangxi dynasty is song, and the medium resource types of the Kangxi dynasty and the song are different.

In some embodiments, in step 703, the target knowledge spectrogram library is determined to be a second error correction knowledge spectrogram library according to the names of the assets, the second error correction knowledge spectrogram library includes assets whose pronunciation similarity of the names of the assets is greater than the second similarity threshold but whose contents of the assets are different, and optionally, the second error correction knowledge spectrogram library includes assets whose pronunciations are the same but whose contents of the assets are different, such as "bone language" in tv drama and "valley rain" in encyclopedic, whose pronunciations are the same and are both "guyu", but whose contents of the assets are different. The embodiment of the disclosure provides an implementation manner, before determining a service parameter corresponding to a media asset type and/or a verb, judging whether a media asset type of a first media asset is the same as a media asset type of a media asset to be fed back, and if so, feeding back the media asset to be fed back and the first media asset as a target media asset to a user; if the two types of media assets are different, the fact that the media asset types included in the user voice are possibly deviated is indicated, a service parameter corresponding to the media asset types and/or verbs included in the user voice and a service parameter corresponding to the actual media asset type of the first media asset need to be further determined, then a target service parameter is calculated according to the service parameter, and the media assets with the same media asset types and different media asset contents are determined from the media assets to be fed back and used as second media assets to meet the actual requirements of the user.

In some embodiments, after determining a target asset matching the asset type and/or verb from the candidate assets, obtaining a historical search record, and determining a first ranking weight of a plurality of target assets according to the historical search record; the historical search records may be associated with user information, and after the user voice is recognized in step S702, the user information corresponding to the user voice is determined, it may be understood that voiceprint information included in the user voice enables the user information to have uniqueness, the unique user information can be determined according to the user voice, and the user information is bound with the historical search records, which is beneficial to analyzing user preferences.

And in the process of displaying the search result corresponding to the target media asset, acquiring a historical search record corresponding to the user information, and determining the first sequencing weight of the target media asset according to the historical search record. Acquiring resource heat parameters of a plurality of target media assets, and determining second sequencing weights of the plurality of target media assets according to the resource heat parameters of the plurality of target media assets; the resource popularity parameter is used for representing the popularity degree of the target media assets, and is obtained by calculating and processing public opinion data by the server, wherein the public opinion data includes but is not limited to playing times, praise times, comment times, forwarding times, sharing times and the like, and the method is not limited by the disclosure. It can be understood that the larger the playing times of the target asset, the larger the resource hot parameter thereof, and correspondingly, the larger the second sorting weight.

Further, a target ranking weight is calculated according to the first ranking weight and the second ranking weight. Optionally, the target ranking weight is an average of the first ranking weight and the second ranking weight. And controlling a display to display the search results corresponding to the target media assets according to the target sorting weight so as to sort the search results of the target media assets by integrating the user preference and the media asset popularity.

For example, as shown in fig. 10, fig. 10 is a schematic diagram of a user interface of a voice search provided in the embodiment of the present disclosure, after determining that target media assets include "no-dyeing" and "left-handed pointed month", a history search record is obtained, the playing frequency of "no-dyeing" in the history search record exceeds "left-handed pointed month", a first ranking weight corresponding to "no-dyeing" is greater than a first ranking weight corresponding to "left-handed pointed month", a second ranking weight corresponding to "no-dyeing" is determined according to the resource heat parameter and is greater than a second ranking weight corresponding to "left-handed pointed month", a target ranking weight corresponding to "no-dyeing" is calculated and is greater than a target ranking weight corresponding to "left-handed pointed month", and, according to the magnitude order of the target ranking weights, as shown in fig. 10, a search result 11 corresponding to "no-dyeing" is displayed above a search result 12 corresponding to left-handed pointed month ".

In summary, the embodiment of the present disclosure provides a voice search method, which identifies an acquired user voice to obtain a keyword therein, where the keyword includes a media asset name, a media asset type and/or a verb, and then determines a candidate media asset from a target knowledge spectrum library according to the media asset name, a first media asset indicated by the media asset name in the candidate media asset, and a media asset to be fed back having an association relation with the first media asset, then determines a target media asset matched with the media asset type and/or the verb from the candidate media asset, and controls a display to display a search result corresponding to the target media asset. The method and the device realize the relation between the context information in the voice of the user and accurately understand the real intention of the user, so that the matched media asset search result can be still accurately fed back to the user under the condition that the name of the media asset searched by the voice of the user is not matched with the type and/or verb of the media asset, and the use experience of the user is improved.

As shown in fig. 11, fig. 11 is a schematic structural diagram of a display device provided in an embodiment of the present disclosure, where the display device includes a processor 1101, a memory 1102, and a computer program stored in the memory 1102 and operable on the processor 1101, and when the computer program is executed by the processor 1101, the computer program implements each process of the voice search method in the foregoing method embodiments. And the same technical effect can be achieved, and in order to avoid repetition, the description is omitted.

The embodiment of the present disclosure provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements each process executed by the voice search method, and can achieve the same technical effect, and in order to avoid repetition, the computer program is not described herein again.

The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

The present disclosure provides a computer program product including a computer program which, when run on a computer, causes the computer to implement the above-described voice search method.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the foregoing discussion in some embodiments is not intended to be exhaustive or to limit the implementations to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A display device, comprising:

a user input interface configured to: acquiring user voice;

a controller configured to: recognizing the user voice, and acquiring a keyword in the user voice, wherein the keyword comprises a medium asset name, a medium asset type and/or a verb;

determining candidate media resources from a target knowledge graph library according to the media resource names, wherein the candidate media resources comprise first media resources indicated by the media resource names and media resources to be fed back, and the media resources to be fed back have an association relation with the first media resources;

and determining target media assets matched with the media asset types and/or verbs from the candidate media assets, and controlling a display to display the search results corresponding to the target media assets.

2. The display device according to claim 1, wherein the target knowledge spectrum library is at least one knowledge spectrum library in a preset knowledge spectrum library; the preset knowledge map library comprises: the system comprises a first error correction knowledge map library, a second error correction knowledge map library and a video and audio knowledge map library;

the first error correction knowledge map library comprises media assets, wherein the pronunciation similarity of the media asset names is greater than a first similarity threshold value, but the media asset types are different;

the second error correction knowledge graph comprises media assets with the media asset name pronunciation similarity larger than a second similarity threshold value, the second similarity threshold value is larger than the first similarity threshold value, and the media assets are the same in terms of media asset name pronunciation similarity but different in media asset content;

the video and audio knowledge map comprises video media assets and music media assets, and the video media assets and the music media assets have corresponding relations.

3. The display device according to claim 1, wherein the number of the target assets is plural;

the controller controls the display to display the search result corresponding to the target asset, and is configured to:

acquiring a historical search record, and determining a first sequencing weight of a plurality of target media assets according to the historical search record;

acquiring resource heat parameters of a plurality of target media resources, and determining second sequencing weights of the plurality of target media resources according to the resource heat parameters of the plurality of target media resources;

calculating a target ranking weight according to the first ranking weight and the second ranking weight;

and controlling the display to display the search result corresponding to the target media asset according to the target sorting weight.

4. The display device of claim 1, wherein the controller, from the candidate assets, determines a target asset matching the asset type and/or the verb, configured to:

determining a service parameter corresponding to the media asset type and/or the verb and a service parameter corresponding to the actual media asset type of the first media asset to calculate a target service parameter, wherein the target service parameter is a service parameter corresponding to the user voice;

and determining the target media resources from the candidate media resources according to the target service parameters.

5. The display device of claim 4, wherein the controller, determining the target asset from the candidate assets according to the target traffic parameter, is configured to:

determining a second medium resource from the medium resources to be fed back according to the target service parameter;

if the type of the second media asset is the same as that of the first media asset, controlling the display to display a search result corresponding to the first media asset and a search result corresponding to the second media asset;

and if the type of the second media asset is different from the type of the first media asset, controlling the display to display a search result corresponding to the second media asset.

6. The display device according to claim 4, wherein the target knowledge atlas database is a second error correction knowledge atlas database; the second error correction knowledge graph comprises media assets with different media asset contents and the pronunciation similarity of the media asset names larger than a second similarity threshold;

before determining the service parameter corresponding to the asset type and/or the verb and the service parameter corresponding to the actual asset type of the first asset to calculate the target service parameter, the controller is further configured to:

judging whether the types of the first media assets and the media assets to be fed back are the same;

the controller determines a service parameter corresponding to the asset type and/or the verb and a service parameter corresponding to an actual asset type of the first asset to calculate a target service parameter, and is configured to:

and under the condition that the first medium resources are different from the medium resource types of the medium resources to be fed back, determining service parameters corresponding to the medium resource types and/or verbs and service parameters corresponding to the actual medium resource types of the first medium resources to calculate target service parameters.

7. The display device of claim 1, wherein the controller, after recognizing the user speech and obtaining the keyword in the user speech, and before determining the candidate asset from the target knowledge spectrum library according to the asset name, is further configured to:

judging whether the media asset name corresponds to the media asset type and/or whether the media asset name corresponds to the verb;

and if the media asset name does not correspond to the media asset type and/or the media asset name does not correspond to the verb, determining the target knowledge spectrum library according to the media asset name.

8. A method of voice searching, comprising:

acquiring user voice;

recognizing the user voice, and acquiring a keyword in the user voice, wherein the keyword comprises a media asset name, a media asset type and/or a verb;

determining candidate media resources from a target knowledge chart library according to the media resource names, wherein the candidate media resources comprise first media resources indicated by the media resource names and media resources to be fed back, which are in an association relationship with the first media resources;

9. The method according to claim 8, wherein the target knowledge spectrogram library is at least one knowledge spectrogram library in a preset knowledge spectrogram library; the preset knowledge map library comprises: the system comprises a first error correction knowledge map library, a second error correction knowledge map library and a video and audio knowledge map library;

the first error correction knowledge graph library comprises media assets, wherein the pronunciation similarity of the media asset names is greater than a first similarity threshold value, but the types of the media assets are different;

the video knowledge map comprises video media assets and music media assets, and the video media assets and the music media assets are in corresponding relation.

10. A computer-readable storage medium, comprising: the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, implements the voice search method according to any one of claims 8 to 9.