WO2023005580A1

WO2023005580A1 - Display device

Info

Publication number: WO2023005580A1
Application number: PCT/CN2022/102456
Authority: WO
Inventors: 张立泽; 戴磊; 马宏; 张大钊; 李霞; 李金凯
Original assignee: 海信视像科技股份有限公司
Priority date: 2021-07-29
Filing date: 2022-06-29
Publication date: 2023-02-02
Also published as: CN117651943A

Abstract

A display device, which may execute dependency syntactic parsing on speech text, so as to obtain at least two syntactic trees. If the probability values of the syntactic trees are equal to each other, then the syntactic tree which has a user intention that matches device state information of the current device is determined to be an optimal syntactic tree (S102). If the probability values of the syntactic trees are not equal to each other, and a user intention of the syntactic tree which has the maximum probability value matches the device state information of the current device, then the syntactic tree which has the maximum probability value is determined to be the optimal syntactic tree, and the user intention of the optimal syntactic tree is determined to be an optimal user intention (S103).

Description

display screen

Cross References to Related Applications

This application claims the priority of the Chinese application with the application date of July 29, 2021 and application number 202110865048.9, and the application date of August 16, 2021 and application number 202110934690.8, the entire contents of which are cited here.

technical field

The present application relates to the technical field of voice interaction, in particular to a content recommendation method and device.

Background technique

With the development of intelligent voice interaction technology, the voice interaction function has gradually become the standard configuration of intelligent terminal products. Users can use the voice interaction function to realize voice control of smart terminal products, and perform a series of operations such as watching videos, listening to music, checking the weather, and controlling TV.

The process of controlling smart terminal products by voice is usually that the voice recognition module recognizes the voice input by the user as text. Afterwards, the semantic analysis module analyzes the lexical syntax and semantics of the text, so as to understand the user's intention. Finally, the control terminal controls the intelligent terminal products to perform corresponding operations according to the understanding results.

Contents of the invention

An embodiment of the present application provides a user intention analysis method, the method includes: acquiring the voice text input by the user, performing semantic analysis on the voice text, and generating at least two syntax trees, wherein the syntax trees have probability values and The user intention, the probability value is the probability that the system outputs the syntax tree; when the probability values of the syntax trees are all equal, the syntax tree matching the user intention and the device state information of the current device is determined as the optimal syntax tree, And determining the user intent of the optimal syntax tree as the optimal user intent; the probability values in the syntax trees are not equal, and the user intent of the syntax tree with the largest probability value matches the device state information of the current device When , the syntax tree with the largest probability value is determined as the optimal syntax tree, and the user intent of the optimal syntax tree is determined as the optimal user intent.

Description of drawings

Fig. 1 is a schematic diagram of the principle of voice interaction according to some embodiments;

Fig. 2 is a schematic flow chart of a method for analyzing user intent according to some embodiments;

Figure 3 is a block diagram of a media asset retrieval system according to some embodiments;

FIG. 4 is a schematic diagram of a user interface in a display device 200 according to some embodiments;

Fig. 5 is a signaling diagram of a content recommendation method according to some embodiments;

Fig. 6 is a signaling diagram of another content recommendation method according to some embodiments.

Detailed ways

In order to make the purpose and implementation of the application clearer, the following will clearly and completely describe the exemplary implementation of the application in conjunction with the accompanying drawings in the exemplary embodiment of the application. Obviously, the described exemplary embodiment is only the present application. Claim some of the examples, not all of them.

It should be noted that the brief description of the terms in this application is only for the convenience of understanding the implementations described below, and is not intended to limit the implementations of this application. These terms are to be understood according to their ordinary and usual meaning unless otherwise stated.

The terms "first", "second", and "third" in the description and claims of this application and the above drawings are used to distinguish similar or similar objects or entities, and do not necessarily mean limiting specific sequential or sequential unless otherwise noted. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprising" and "having", as well as any variations thereof, are intended to be inclusive but not exclusive, for example, a product or device comprising a series of components is not necessarily limited to all components expressly listed, but may include not expressly listed other components listed or inherent to these products or equipment.

The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code capable of performing the function associated with that element.

In order to clearly illustrate the embodiment of the present application, a voice recognition network architecture provided by the embodiment of the present application will be described below with reference to FIG. 1 .

Referring to FIG. 1, FIG. 1 is a schematic diagram of a speech recognition network architecture provided by an embodiment of the present application. In FIG. 1 , the smart device is used to receive input information and output processing results of the information. The speech recognition service device is an electronic device deployed with a speech recognition service, the semantic service device is an electronic device deployed with a semantic service, and the business service device is an electronic device deployed with a business service. The electronic device here may include a server, a computer, etc., and the speech recognition service, semantic service (also called a semantic engine) and business service here are web services that can be deployed on the electronic device, wherein the speech recognition service is used for audio Recognized as text, the semantic service is used for semantic analysis of the text, and the business service is used to provide specific services such as the weather query service of Moji Weather, the music query service of QQ Music, etc. In an embodiment, there may be multiple entity service devices deployed with different business services in the architecture shown in FIG. 1 , or one or more functional services may be integrated in one or more entity service devices.

In some embodiments, the following is an example description of the process of processing the information input to the smart device based on the architecture shown in Figure 1. Taking the information input to the smart device as a query sentence input by voice as an example, the above process may include the following three processes :

[Speech Recognition]

After receiving the query sentence input by voice, the smart device can upload the audio of the query sentence to the voice recognition service device, so that the voice recognition service device can recognize the audio as text through the voice recognition service and return it to the smart device. In one embodiment, before uploading the audio of the query sentence to the speech recognition service device, the smart device may perform denoising processing on the audio of the query sentence, where the denoising processing may include steps such as removing echo and environmental noise.

[semantic understanding]

The smart device uploads the text of the query sentence recognized by the speech recognition service to the semantic service device, so that the semantic service device can perform semantic analysis on the text through the semantic service to obtain the business field and intention of the text.

[semantic response]

According to the semantic analysis result of the text of the query statement, the semantic service device sends a query instruction to the corresponding business service device to obtain the query result given by the business service. The smart device can obtain and output the query result from the semantic service device. As an embodiment, the semantic service device can also send the semantic analysis result of the query sentence to the smart device, so that the smart device can output the feedback sentence in the semantic analysis result.

It should be noted that the architecture shown in FIG. 1 is only an example, and does not limit the protection scope of the present application. In the embodiment of the present application, other architectures may also be used to implement similar functions. For example, all or part of the three processes may be completed by a smart terminal, which will not be described in detail here.

In some embodiments, the smart device shown in Figure 1 can be a display device, such as a smart TV, and the function of the voice recognition service device can be realized by the cooperation of the sound collector and the controller set on the display device, and the semantic service device and business service device The functions of can be realized by the controller of the display device, or by the server of the display device.

In order to clearly illustrate the embodiments of the present application, some technical terms are explained below:

[Voiceprint]

Voiceprint is the sound wave spectrum that carries speech information displayed by electro-acoustic instruments. It is a biological feature composed of more than a hundred characteristic dimensions such as wavelength, frequency, and intensity. It has the characteristics of undetermined, measurable, and unique.

The current mainstream speaker clustering algorithm is based on the speaker segmentation, based on the Bayesian information criterion, using the agglomerative hierarchical clustering algorithm to directly judge the speech segments after the speaker segmentation, which will belong to the same speaker Human speech segments are combined into one category. The basic idea is to extract feature parameters from each speech segment, such as Mel cepstrum parameters, calculate the similarity of feature parameters between each two speech segments, and use BIC to judge whether the two speech segments with the highest similarity are merged into same class. The above-mentioned judgments are performed on any two speech segments until all speech segments are no longer merged.

[User portrait]

By collecting data in various dimensions such as user social attributes, consumption habits, and preference characteristics, and then characterizing user or product characteristics, and analyzing and statistically mining potential value information on these characteristics, a complete picture of a user can be abstracted. User portraits are a prerequisite for targeted advertising or personalized recommendations.

The process of controlling smart terminal products by voice is usually that the voice recognition module recognizes the voice input by the user as text. Afterwards, the semantic analysis module analyzes the lexical syntax and semantics of the text, so as to understand the user's intention. Then feed back the recommended media information or data to the smart device according to the retrieval intention.

However, current smart terminal products usually do not consider the current state and scene of the device when understanding user intentions, and only analyze user intentions based on user dimensions or network popularity. For example, if the user enters "Everyday Upward", the user may search for results such as showing the encyclopedia introduction of Everyday Upward, interesting quizzes and answers to learn about Everyday Upward, and playing the variety show of Everyday Upward. If the current scene of the smart terminal is ignored, there may be a deviation between the execution result and the user's actual intention. For example, when the smart terminal is currently in a video playback application, the user's intention may be to watch Tiantianupward variety shows; if the current device status is not considered, the possible execution result is to display Tiantianupward's encyclopedia introduction, which results in There is a deviation between the execution result and the user's actual intention.

In order to solve the above problems, this application provides a user intent analysis method, which can not only be based on the user dimension, but also embed device dimension information during the intent analysis process, so that the intent analysis is more accurate, and finally the terminal device can accurately perform corresponding operations. , to improve user experience.

The flow diagram of the semantic understanding method as shown in Figure 2, the method includes the following steps:

In step S101, the speech text input by the user is obtained, and the speech text is analyzed using dependency syntax to generate at least two syntax trees.

What needs to be explained is that when using dependency syntax to analyze speech text, it is possible to generate only one syntax tree, and then just execute the only corresponding result. The embodiment of the present application uses generating at least two syntax trees as an example to illustrate the solution.

The voice text is obtained by analyzing the voice signal input by the user. Specifically, the user inputs the voice signal within the range of the terminal device receiving the signal. The terminal device may collect a voice signal input by the user through a microphone, and then obtain and recognize the voice text from the voice signal.

In this embodiment of the present application, the voice text can be recognized by the voice recognition server. Semantic analysis is performed on the speech text by the semantic server. Specifically, word segmentation processing is first performed on the speech text. Based on the thesaurus, use the forward maximum matching method to perform word segmentation. For example, "Andy Lau's movie New Shaolin Temple", after word segmentation processing, the word segmentation "Andy Lau, of, movie, new Shaolin Temple" is obtained.

To further perform part-of-speech tagging on word segmentation, for example, LAC (Lexical Analysis of Chinese) lexical analysis tools can be used to perform Chinese word segmentation and part-of-speech tagging on media titles. The LAC lexical analysis tool is a combined lexical analysis model that can complete Chinese word segmentation and part-of-speech tagging as a whole, and can also add a custom dictionary to identify proper names. The input of the LAC lexical analysis task is a string, and the output is the word boundary and part of speech in the media title. After that, the dependency syntax is used to extract the user's intent in the speech text according to the result of part-of-speech tagging. Dependency syntax analysis uses global search to generate multiple syntax trees, each sentence corresponds to one or more syntax trees, and each syntax tree has probability values and user intentions. A common practice in related technologies is that the system outputs the syntax tree with the highest probability. Finally, the user intention of the syntax tree with the highest probability is determined as the user intention in the speech text.

It should be noted that the word segmentation and part-of-speech tagging tools used in this application are not limited to the LAC lexical analysis tool, and other lexical analysis tools can also be used.

Step S102 , determining the syntax tree with the user intent matching the device state information of the current device as the optimal syntax tree, and determining the user intent of the optimal syntax tree as the optimal user intent.

The device status information of the current device in this embodiment of the present application may include information such as device type, device mode, and terminal status. The device type can be TV, refrigerator, speaker, etc., the device mode can be TV mode, speaker mode, children’s mode, etc., and the terminal status can be the application or interface information that the device is currently in. Both the device mode and the terminal state are attached to the device type, so there are dependencies in the three dimensions. When determining the optimal syntax tree, all the syntax trees are matched with the device state information of the current device, and the matched syntax tree is the optimal syntax tree.

It should be noted that different devices support different skills, the same device supports different skills in different modes, and the same device supports different skills in different interfaces in the same mode.

In the embodiment of the present application, after receiving the voice command input by the user, the information corresponding to the voice command and the current device status of the device are sent to the server at the same time, and the server performs voice recognition and semantic analysis, and sends the corresponding The syntactic tree and the device status are combined to obtain the optimal syntactic tree, and finally media resources are recommended according to the final syntactic tree.

An example of the device status information at the device mode level: the current device is a display device, and the device mode of the current device is a children's mode. Receive the voice information of "Play Mulan" and parse out two syntax trees. The user intent of syntax tree A is to play the live-action movie Mulan; the user intent of syntax tree B is to play the cartoon Mulan. And when the device mode of the display device is the children's mode, the display device is not allowed to play live-action movies. Then the user intention of the syntax tree A does not match the device state information of the current device, and the syntax tree A cannot be determined as the optimal syntax tree. When the device mode of the display device is the kids mode, the display device is allowed to play cartoons. Then the user intention of the syntax tree B matches the device status information of the current device, and the syntax tree B can be determined as the optimal syntax tree. Finally, the user intention of syntax tree B to play cartoons is determined as the optimal user intention.

An example of device status information at the device interface level: receive the "Two Liang Beef" input by the user, and analyze the syntax tree determined by the voice text input by the user. The user intention of syntax tree A is to manage beef ingredients, and the user intent of syntax tree B is The intention is to buy two catties of beef. If the current device is a smart refrigerator, if the device interface of the current device is an ingredient management interface. Then the user intention "manage beef ingredients" in the syntax tree A matches the device status information of the current device. The syntax tree A can be determined as the optimal syntax tree, and the user intention "managing beef ingredients" possessed by the syntax tree A can be determined as the optimal user intention; if the inter-device interface of the current device is a shopping interface. Then the user intention "buy two catties of beef" in the syntax tree B matches the device state information of the current device. The syntax tree B can be determined as the optimal syntax tree, and the user intention "buy two catties of beef" possessed by the syntax tree B can be determined as the optimal user intention.

Step S103, if the probability values of all the syntax trees are not equal, and the user intention of the syntax tree with the largest probability value matches the device state information of the current device, then determine the syntax tree with the largest probability value as the optimal syntax tree, And determining the user intention possessed by the optimal syntax tree as the optimal user intention. It should be noted that the probability values of all the syntax trees here are not equal, it can be that the probability values of all the syntax trees are not equal, or the probability values of at least two syntax trees among all the syntax tree probability values are not equal .

For example, a syntax tree A, a syntax tree B, and a syntax tree C are obtained after performing semantic analysis on the speech and text input by the user. Among them, the probability values of the three syntax trees are not equal, and the probability value of syntax tree A is the largest. If the user intention of the syntax tree A matches the device state information of the current device, then determine the syntax tree A as the optimal syntax tree, and determine the user intention of the syntax tree A as the optimal user intention.

In some embodiments, as shown in the flow chart of FIG. 3 , the method of the embodiment of the present application further includes sorting all the syntax trees according to the descending order of the probability values. For example, the syntax tree A, syntax tree B, and syntax tree C of the above embodiment, wherein the probability value of syntax tree A is 0.96, the probability value of syntax tree B is less than 0.96, and the probability value of syntax tree C is smaller than the probability value of syntax tree B. Then, according to the probability value, they are sorted from large to small: syntax tree A, syntax tree B, and syntax tree C.

When determining the optimal syntax tree, it is first judged whether the user intention of the syntax tree A matches the device state information of the current device. If the user intention of the syntax tree A matches the device status information of the current device, then the syntax tree A is determined to be the optimal syntax tree. If the user intention of syntax tree A does not match the device state information of the current device, it is further judged whether the deviation between the probability value of syntax tree B ranked second and the probability value of syntax tree A is less than the deviation threshold.

If the difference between the probability value of the syntax tree B and the probability value of the syntax tree A is less than the deviation threshold, it is further determined whether the syntax tree B has the user intention and matches the device state information of the current device.

If the user intention of the syntax tree B matches the device state information of the current device, then determine the syntax tree B as the optimal syntax tree, and determine the user intention of the syntax tree B as the optimal user intention. If the user intention of the syntax tree B does not match the device status information of the current device, the same judgment operation is further performed on the syntax tree C.

If the difference between the probability value of syntax tree B and the probability value of syntax tree A is greater than the deviation threshold, then syntax tree A is still determined to be the optimal syntax tree. At this time, since the user intention of the syntax tree A does not match the device state information of the current device, a prompt may be displayed to the user, and the prompt is used to remind the user that the current device cannot perform the operation corresponding to the optimal user intention. A manner of displaying the prompt to the user may be displaying the prompt on a display, or displaying the prompt by voice broadcast.

In some embodiments, the traditional smart device media resource retrieval method relies on the user's explicit search intention. In some customized scenarios, if the user's clear search intent cannot be obtained, the smart device can only give the user a simple text reply, or even fail to give the user a reply. Therefore, the traditional smart device media asset retrieval method has poor user experience for users.

The application provides a media resource retrieval system, a frame diagram of the media resource retrieval system shown in FIG. 3 , the system includes a display device 200 and a server 400 . The display device 200 further includes a display, a communicator, a sound collector, and a controller. The display is used to display the user interface. The communicator is used for data communication with the server 400 . The voice collector user collects voice information input by the user. The server 400 is configured to provide various media resource information and media resource data to the display device.

In some embodiments, the process for the user to use the media asset retrieval system of this embodiment to perform media asset retrieval is specifically:

First, the user inputs an instruction for waking up the voice interaction function of the display device, and drives the sound collector to start working according to the instruction. The tool for waking up the semantic interaction function of the display device can be a built-in or installed application, such as a voice assistant.

In some optional embodiments, the way to wake up the voice assistant may be to wake up through the first voice information input by the user in the far field. For example, the first voice information is a preset wake-up word. Degree", or "Hisense Small Gathering" and other preset wake-up words, so as to wake up the voice interaction function of the display device. In some optional embodiments, the wake-up word can be set by the user, such as "I love my home", "TV TV" and so on.

In some other optional implementation manners, the user may also directly touch the voice key on the remote controller, and the display device starts the voice assistant service according to the key instruction.

After waking up the voice interaction function of the display device, the user performs voice interaction with the display device, and the voice collector collects other voice information input by the user. If further search keywords that can be used to search for media asset content are not obtained from the sound collector, that is, no clear user intention can be obtained, an alternative media asset request is directly sent to the server. The server receives the candidate media resource information searched according to the candidate media resource request, and feeds back the candidate media resource information to the display device. After receiving the candidate media resource information, the display device displays the candidate media resource information on the display.

Specifically, the situation that the display device receives the voice instruction can be determined according to the situation that the voice collector collects the voice information.

In the first scenario, the second voice information further input by the user is not received, or the search keyword cannot be recognized from the second voice information. Wherein, the process of identifying user intention from voice information is a related technology, and this application will not elaborate on it.

In the second scenario, second voice information further input by the user is received, and search keywords are identified from the second voice information, but the identified search keywords cannot be used to search for media asset content. For example, the identified search keyword is not a preset keyword, that is, the search keyword is not a keyword indicating the business scope of the display device.

After the media asset retrieval process in the above embodiment, even if no clear user intention can be obtained, or the recognized user intention is not within the business scope of the display device, the server can also feed back corresponding media asset information according to different scenarios in which the display device is located. And display the corresponding media information on the monitor to avoid the occurrence of no reply.

Exemplarily, the first scenario may be a scenario where there is no content input for a period of time after the user wakes up the voice assistant. For example, after the user enters the wake-up word "Hello, Xiaodu" and there is no content input, the search keyword for searching media asset content cannot be identified from the wake-up word. At this point, it may be determined that the current scene of the display device is the first scene, and the display device sends a media asset request to the server, where the media asset request carries information about the first scene. The server searches for corresponding first media asset information according to the first scene information, and feeds back the first media asset information.

The second scenario may be that the user further inputs voice information after waking up the voice assistant, and can identify search keywords from the input voice information. However, this search keyword is not within the scope of the display device business. For example, after the user wakes up the voice assistant, and then enters the voice message "play XX game video". Although the search keyword of "XX game video" can be identified from the voice information, "XX game video" is not a preset keyword, that is, XX game video is beyond the business scope of the display device.

In some embodiments, when the search keyword that can be used to search for media asset content is not obtained from the sound collector, the specific process of receiving from the server the candidate media resource information searched according to the candidate media resource request may be:

It is judged whether the voiceprint information can be determined from the first voice information, and if the voice information can be determined from the first voice information, then the voiceprint information is sent to the server. The server determines the user profile based on the voiceprint information, and then searches for alternative media information based on the user profile. Voiceprint information may include voiceprint ID and voiceprint attributes. If both the voiceprint ID and the voiceprint attribute can be determined from the first voice information, since each user has a unique voiceprint ID, the user profile is determined according to the voiceprint ID.

If the voiceprint ID can only be determined from the first voice information, the voiceprint ID is sent to the server. The server determines the user portrait uniquely corresponding to the voiceprint ID according to the voiceprint ID. The server then searches for candidate media resource information according to the determined user profile.

It should be noted that the display device may be a family TV, and at this time, the display device stores the voiceprint IDs of family members according to the voice access history. For example, the server stores the voiceprint IDs of grandpa, grandma, father, and mother. When the grandfather uses the display device to input voice information, the display device first sends the device ID of the display device to the server. The server searches for the voiceprint ID corresponding to the device according to the device ID.

Since the voiceprint ID of the grandfather is stored in advance, according to the characteristics of the voiceprint, it can be determined that the voiceprint ID of the grandfather can be recognized in the input voice information. Further determine the corresponding user portrait according to the grandfather's voiceprint ID. Then search for alternative media information based on the user portrait. In this way, the media asset information determined through the user portrait is related to the current user. If the guest uses the display device to input voice information, the display device first sends the ID of the display device to the server. Since the guest's voiceprint ID is not stored in advance. Then the server cannot determine the voiceprint ID according to the voice information.

In some embodiments, if the voiceprint ID cannot be determined from the voice information, but the voiceprint attribute can be determined from the voice information, then the voiceprint attribute is sent to the server. The server determines the corresponding user portrait according to the voiceprint attribute, and searches for candidate media resource information according to the user portrait. The voiceprint attribute here may be a user characteristic of a type of user. The user characteristics may include the user's gender, age and other physiological characteristics.

For example, if the voiceprint attribute determined from the voice information is a middle-aged male, then the determined user portrait corresponds to a middle-aged male. The media information searched based on user portraits may be related to finance, automobiles, etc. If the voiceprint attribute determined from the voice information is a child, the determined user portrait corresponds to the child. The media resource information searched based on the user portrait may be media resource information related to cartoons.

In some embodiments, the identification history of the device can also be statistically displayed under the voiceprint feature. That is, statistics display all the voiceprint attributes recognized by the device. If the proportion of the recognition history record of a certain voiceprint attribute exceeds the preset threshold, the voiceprint attribute will be sent to the server. If the proportion of voiceprint attribute recognition history records exceeds the preset threshold, it means that the number of users of this type of display device is the most.

For example, if the voiceprint attribute is that the recognition history records of children account for more than 80%, it means that children users use the display device the most times. Send the child with the voiceprint attribute to the server, so that the server can feed back the media asset information corresponding to the child's user portrait.

In some embodiments, if neither the voiceprint ID nor the voiceprint attribute can be determined from the first voice information. The voiceprint ID or voiceprint attribute is then determined according to the voice information input by the user last time. It should be noted that the time between the moment when the user entered the voice information last time and the moment when the voice assistant is awakened does not exceed the preset time. For example, the time when the voice assistant is currently awakened is within 30 seconds from the time when the voice information was input last time.

In this way, it can be roughly determined that the user who wakes up the voice assistant this time is the same person as the user who woke up the voice assistant last time. Factors such as age, therefore, the content recommended to users can motivate users to make further interactions.

In some embodiments, the user portrait storage structure includes at least two tendency fields, and each tendency field includes at least two query dimensions. The preference domain is set with the weight of the preferred domain, and the query dimension is set with the weight of the query dimension. Different user portrait storage results include different tendencies and query dimensions. For example, the user portrait includes the fields of preference "movie", "music", "recipe", "variety show" and so on. Among them, the inclination field "movies" includes query dimensions "war movies" and "action movies", etc., the inclination field "music" also includes query dimensions "popular" and "popular", etc., and the inclination field "recipes" also includes query dimensions " "Cantonese Cuisine", "Sichuan Cuisine", etc., and the preferred field "Variety Show" also includes query dimensions "reality show", "blind date", etc.

The preferred domains in the above examples all have preferred domain weights, and the preferred domain weights can be set according to user portraits, for example, according to the number of viewing times of users. Query dimensions also have query dimension weights, which can also be set according to user portraits. First, according to the weight of the tendency field, the top rankings can be calculated using a weighted random algorithm. For example, the weights of the top three tendency fields are "movies", "music", and "recipes".

The media resource database in the embodiment of the present application is provided with at least two media resource cards, and the media resource cards correspond to the preferred fields. For example, media resource cards such as "movies", "music" and "recipes" are set in the media resource library. In the media asset library, the media asset card is also set to have a weight. After calculating the top three preferred fields according to the weight of the inclined field, the final card is selected according to the weight of the media capital card. A weighted random algorithm can also be used. For example, the selected final card is "music", that is, the final determined field of inclination is "music".

After determining the final inclination field "music", based on the weight of the query dimension, a weighted random algorithm is used to determine the final query dimension. For example, determine the final query dimension as "popularity". Finally, by inquiring about the music query service in the video query service, the media resource query is performed based on the media resource card "music" and the query dimension "pop". Finally, the media asset information of the media asset card "music" and the query dimension "popular" can be randomly fed back to the user. For example, feed back media information about popular songs sung by Xu Wei.

In some embodiments, different media resource libraries, ie, card pools, are stored in the server for different scenarios of the display device. The first scene is a scene in which no second voice information is input or a search keyword cannot be recognized from the second voice information, for example, it may be a scene in which no content is input for a period of time after the user wakes up the voice assistant. For this scenario, the server stores the card pool shown in Table 1.

Table 1 Card pool for the first scene

the	卡片名称card name	卡片类型card type
11	教育educate	eduedu
22	广播broadcast	fmfm
33	游戏game	gamegame
44	应用application	appapp
55	音乐music	client_musicclient_music
66	帮助信息help information	client_helpinfoclient_helpinfo
77	电视剧TV drama	tvplaytvplay
88	电影Movie	filmfilm

For the first scenario, the card pool stored by the server contains more guessed cards that the user may like.

The second scenario is that the search keyword can be recognized from the voice information input by the user, but the search keyword cannot be used to search for media content, that is, the user's intention is beyond the business scope of the display device. For this scenario, the server stores the card pool shown in Table 2.

Table 2 Card pool for the second scene

the	卡片名称card name	卡片类型card type
11	应用application	appapp
22	新闻news	client_newsclient_news
33	音乐music	client_musicclient_music
44	帮助信息help information	client_helpinfoclient_helpinfo
88	电视剧TV drama	tvplaytvplay

For the second scenario, more cards in the card pool stored by the server are used to guide the user to use the voice assistant.

The embodiment shown in FIG. 4 is a scenario where there is no content input for a period of time after the user wakes up the voice assistant. After the above steps, the display device can obtain three types of media asset cards from the server. All three types of cards are used to guide users to voice input. The user of the first card in Fig. 4 guides the user to input voice information such as "some nice music", "today's hot news", "today's weather" and so on.

In addition to the first scenario and the second scenario in the above embodiment, this application can also set specific card pools for other scenarios, and other scenarios can be system-side custom scenarios.

For example, when the voice message "good morning" is received, it may be determined that the current scene of the display device is a morning greeting scene. After that, identify the voiceprint ID or voiceprint attribute from the voice information, and obtain the media resource card for the morning greeting scene from the server according to the voiceprint ID or voiceprint attribute.

When the voice message "I'm home" is received, it can be determined that the current scene of the display device is the home scene. According to the voiceprint ID or voiceprint attribute, the media resource card for the home scene is obtained from the server.

When the user interface of the display device is in the APP operation interface for a long time without receiving any operation instructions from the user, after detecting this scene, the media resource card used to guide the operation of the APP interface can be obtained from the server.

When the display device fails to call the system service, after detecting the scene, the media resource card used to guide how to eliminate the fault can be obtained from the server.

When the received voice message is a complaint message, for example, input the voice message "I am very tired today", after detecting the scene, the media card related to soothing music and funny movies can be obtained from the server.

In some embodiments, while the media asset card is obtained from the server and displayed, different prompts may also be provided according to specific scenarios. For example, display the greeting "Good morning", "Good evening", etc. on the user interface according to the time. Or in the coming home scenario, display the greeting "Welcome home" on the user interface.

The embodiment of the present application provides a content recommendation method, as shown in the signaling diagram of the content recommendation method in Figure 5, the method includes the following steps:

Step 501: Receive an instruction input by a user for activating the voice interaction function, and drive the sound collector to start according to the instruction, wherein the instruction is input in the form of a first voice message or a button.

Step 502: After starting the sound collector, if no search keywords that can be used to search for media asset content are obtained, send the voiceprint information related to the first voice information to the server.

Step 503: After the display device receives the candidate media resource information fed back by the server, it displays the candidate media resource information on the display, wherein the candidate media resource data is the voiceprint associated with the server according to the first voice information. Information OK.

Based on the above method embodiment, the embodiment of the present application provides another content recommendation method, as shown in FIG. 6, the method includes the following steps:

Step 601: Receive an instruction for activating the voice listening function, and drive the sound collector to start, wherein the instruction for activating the voice listening function may be pass.

Step 602, after starting the voice collector, but when no search keywords that can be used to search for media content are obtained from the voice collector, voiceprint information is extracted from the first voice information.

In step 603, send a candidate media resource request to the server, and the candidate media resource request carries voiceprint information.

The server determines the corresponding user portrait according to the voiceprint information. According to the user portrait, the corresponding candidate media asset information is searched in the server's media asset database. The server feeds back the candidate media resource information to the display device. After the display device receives the fed back candidate media resource information, it displays the candidate media resource information on the display.

In general, the computer instructions for realizing the method of the present invention can be carried by any combination of one or more computer-readable storage media. A non-transitory computer-readable storage medium may include any computer-readable medium except the transitory propagating signal itself.

A computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples (non-exhaustive list) of computer-readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), computer Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this document, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.

Computer program code for carrying out the operations of the present invention may be written in one or more programming languages, or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and conventional Procedural programming language—such as "C" language or similar programming language, especially Python language suitable for neural network computing and platform frameworks based on TensorFlow, PyTorch, etc. can be used. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. Where a remote computer is involved, the remote computer may be connected to the user computer via any kind of network, including a local area network (LAN) or wide area network (WAN), or to an external computer (e.g., via an Internet connection using an Internet service provider). ).

Although the embodiments of the present invention have been shown and described above, it should be understood that the above embodiments are exemplary and cannot be construed as limiting the present invention, and those skilled in the art can make the above-mentioned The embodiments are subject to changes, modifications, substitutions and variations.

Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalent technologies, the present application is also intended to include these modifications and variations.

Claims

A user intent analysis method, comprising:

Obtain the voice text input by the user, perform semantic analysis on the voice text, and generate at least two syntax trees, wherein the syntax tree has a probability value and user intention, and the probability value is the probability that the system outputs the syntax tree;

When the probability values of the syntax trees are all equal, determine the syntax tree whose user intention matches the device state information of the current device as the optimal syntax tree, and determine the user intention of the optimal syntax tree as the optimal user intention;

When the probability values of the syntax trees are not equal, and the user intention of the syntax tree with the largest probability value matches the device state information of the current device, determine the syntax tree with the largest probability value as the optimal syntax tree , and determine the user intent possessed by the optimal syntax tree as the optimal user intent.
The user intention analysis method according to claim 1, said method further comprising:

When the probability values of the syntax trees are all equal, and there are multiple syntax trees that the user intends to match with the device status information of the current device, search for the syntax tree with the highest popularity for the media resources corresponding to the user intention The optimal syntax tree is determined, and the user intention of the optimal syntax tree is determined as the optimal user intention.
The user intention analysis method according to claim 1, said method further comprising:

When the probability values of the syntax trees are all equal, and there are multiple syntax trees that the user intends to match with the device state information of the current device, the syntax tree that matches the device state information of the current device has The media resources corresponding to the user intent are displayed to the user.
According to the user intention analysis method according to claim 1, the method further comprises: sorting the syntax tree according to the probability value from large to small, and the probability values of the syntax tree are not equal, and the probability value is the largest The user intention of the syntax tree does not match the device state information of the current device, and the probability value of the syntax tree whose probability value ranks second is less than the deviation threshold from the probability value of the syntax tree with the largest probability value When , the syntax tree whose probability value ranks second is determined as the optimal syntax tree, and the user intent of the optimal syntax tree is determined as the optimal user optimal syntax tree intent.
According to the user intention analysis method according to claim 4, the method further comprises: when the user intention of the syntax tree whose probability value is ranked second is matched with the device status information of the current device, ranking the probability value at the second position The second syntax tree is determined as an optimal syntax tree, and the user intention of the optimal syntax tree is determined as the optimal user intention;

When the user intention of the syntax tree ranked second in probability value does not match the device status information of the current device, the syntax tree ranked second in probability value is not determined as the optimal syntax tree.
According to the user intention analysis method according to claim 1, the method further comprises: sorting the syntax tree according to the probability value from large to small, the probability values in the syntax tree are not equal, and the probability values The user intention corresponding to the largest syntax tree does not match the device status information of the current device, and the probability value of the syntax tree ranked second in the probability value deviation from the probability value of the syntax tree with the largest probability value is greater than When the deviation threshold is used, the syntax tree with the largest probability value is determined as the optimal syntax tree, and the user intention of the optimal syntax tree is determined as the optimal user intention.
The user intention analysis method according to claim 6, further comprising: displaying prompts to the user, the prompts are used to prompt the user that the current device cannot perform the operation corresponding to the optimal user intention.
A content recommendation method, characterized in that the method is applied to a display device, comprising:

receiving an instruction input by the user for waking up the voice interaction function, and driving the sound collector to start according to the instruction, wherein the instruction is input in the form of a first voice message or a button;

When no search keyword that can be used to search for media asset content is obtained from the sound collector, send an alternative media asset request to the server;

receiving, from the server, candidate media resource information searched according to the candidate media resource request, and displaying the candidate media resource information on the display.