WO2023005580A1 - Display device - Google Patents

Display device Download PDF

Info

Publication number
WO2023005580A1
WO2023005580A1 PCT/CN2022/102456 CN2022102456W WO2023005580A1 WO 2023005580 A1 WO2023005580 A1 WO 2023005580A1 CN 2022102456 W CN2022102456 W CN 2022102456W WO 2023005580 A1 WO2023005580 A1 WO 2023005580A1
Authority
WO
WIPO (PCT)
Prior art keywords
syntax tree
user
optimal
probability value
user intention
Prior art date
Application number
PCT/CN2022/102456
Other languages
French (fr)
Chinese (zh)
Inventor
张立泽
戴磊
马宏
张大钊
李霞
李金凯
Original Assignee
海信视像科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202110865048.9A external-priority patent/CN113593559A/en
Priority claimed from CN202110934690.8A external-priority patent/CN114281952A/en
Application filed by 海信视像科技股份有限公司 filed Critical 海信视像科技股份有限公司
Priority to CN202280047134.1A priority Critical patent/CN117651943A/en
Publication of WO2023005580A1 publication Critical patent/WO2023005580A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Definitions

  • the present application relates to the technical field of voice interaction, in particular to a content recommendation method and device.
  • the voice interaction function has gradually become the standard configuration of intelligent terminal products. Users can use the voice interaction function to realize voice control of smart terminal products, and perform a series of operations such as watching videos, listening to music, checking the weather, and controlling TV.
  • the process of controlling smart terminal products by voice is usually that the voice recognition module recognizes the voice input by the user as text. Afterwards, the semantic analysis module analyzes the lexical syntax and semantics of the text, so as to understand the user's intention. Finally, the control terminal controls the intelligent terminal products to perform corresponding operations according to the understanding results.
  • An embodiment of the present application provides a user intention analysis method, the method includes: acquiring the voice text input by the user, performing semantic analysis on the voice text, and generating at least two syntax trees, wherein the syntax trees have probability values and The user intention, the probability value is the probability that the system outputs the syntax tree; when the probability values of the syntax trees are all equal, the syntax tree matching the user intention and the device state information of the current device is determined as the optimal syntax tree, And determining the user intent of the optimal syntax tree as the optimal user intent; the probability values in the syntax trees are not equal, and the user intent of the syntax tree with the largest probability value matches the device state information of the current device When , the syntax tree with the largest probability value is determined as the optimal syntax tree, and the user intent of the optimal syntax tree is determined as the optimal user intent.
  • Fig. 1 is a schematic diagram of the principle of voice interaction according to some embodiments
  • Fig. 2 is a schematic flow chart of a method for analyzing user intent according to some embodiments
  • Figure 3 is a block diagram of a media asset retrieval system according to some embodiments.
  • FIG. 4 is a schematic diagram of a user interface in a display device 200 according to some embodiments.
  • Fig. 5 is a signaling diagram of a content recommendation method according to some embodiments.
  • Fig. 6 is a signaling diagram of another content recommendation method according to some embodiments.
  • module refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code capable of performing the function associated with that element.
  • FIG. 1 is a schematic diagram of a speech recognition network architecture provided by an embodiment of the present application.
  • the smart device is used to receive input information and output processing results of the information.
  • the speech recognition service device is an electronic device deployed with a speech recognition service
  • the semantic service device is an electronic device deployed with a semantic service
  • the business service device is an electronic device deployed with a business service.
  • the electronic device here may include a server, a computer, etc.
  • the speech recognition service, semantic service (also called a semantic engine) and business service here are web services that can be deployed on the electronic device, wherein the speech recognition service is used for audio Recognized as text, the semantic service is used for semantic analysis of the text, and the business service is used to provide specific services such as the weather query service of Moji Weather, the music query service of QQ Music, etc.
  • the speech recognition service is used for audio Recognized as text
  • the semantic service is used for semantic analysis of the text
  • the business service is used to provide specific services such as the weather query service of Moji Weather, the music query service of QQ Music, etc.
  • there may be multiple entity service devices deployed with different business services in the architecture shown in FIG. 1 or one or more functional services may be integrated in one or more entity service devices.
  • the following is an example description of the process of processing the information input to the smart device based on the architecture shown in Figure 1. Taking the information input to the smart device as a query sentence input by voice as an example, the above process may include the following three processes :
  • the smart device After receiving the query sentence input by voice, the smart device can upload the audio of the query sentence to the voice recognition service device, so that the voice recognition service device can recognize the audio as text through the voice recognition service and return it to the smart device.
  • the smart device before uploading the audio of the query sentence to the speech recognition service device, the smart device may perform denoising processing on the audio of the query sentence, where the denoising processing may include steps such as removing echo and environmental noise.
  • the smart device uploads the text of the query sentence recognized by the speech recognition service to the semantic service device, so that the semantic service device can perform semantic analysis on the text through the semantic service to obtain the business field and intention of the text.
  • the semantic service device sends a query instruction to the corresponding business service device to obtain the query result given by the business service.
  • the smart device can obtain and output the query result from the semantic service device.
  • the semantic service device can also send the semantic analysis result of the query sentence to the smart device, so that the smart device can output the feedback sentence in the semantic analysis result.
  • FIG. 1 is only an example, and does not limit the protection scope of the present application. In the embodiment of the present application, other architectures may also be used to implement similar functions. For example, all or part of the three processes may be completed by a smart terminal, which will not be described in detail here.
  • the smart device shown in Figure 1 can be a display device, such as a smart TV, and the function of the voice recognition service device can be realized by the cooperation of the sound collector and the controller set on the display device, and the semantic service device and business service device The functions of can be realized by the controller of the display device, or by the server of the display device.
  • the voice interaction function has gradually become the standard configuration of intelligent terminal products. Users can use the voice interaction function to realize voice control of smart terminal products, and perform a series of operations such as watching videos, listening to music, checking the weather, and controlling TV.
  • Voiceprint is the sound wave spectrum that carries speech information displayed by electro-acoustic instruments. It is a biological feature composed of more than a hundred characteristic dimensions such as wavelength, frequency, and intensity. It has the characteristics of undetermined, measurable, and unique.
  • the current mainstream speaker clustering algorithm is based on the speaker segmentation, based on the Bayesian information criterion, using the agglomerative hierarchical clustering algorithm to directly judge the speech segments after the speaker segmentation, which will belong to the same speaker Human speech segments are combined into one category.
  • the basic idea is to extract feature parameters from each speech segment, such as Mel cepstrum parameters, calculate the similarity of feature parameters between each two speech segments, and use BIC to judge whether the two speech segments with the highest similarity are merged into same class. The above-mentioned judgments are performed on any two speech segments until all speech segments are no longer merged.
  • the process of controlling smart terminal products by voice is usually that the voice recognition module recognizes the voice input by the user as text. Afterwards, the semantic analysis module analyzes the lexical syntax and semantics of the text, so as to understand the user's intention. Then feed back the recommended media information or data to the smart device according to the retrieval intention.
  • current smart terminal products usually do not consider the current state and scene of the device when understanding user intentions, and only analyze user intentions based on user dimensions or network popularity. For example, if the user enters "Everyday Upward", the user may search for results such as showing the encyclopedia introduction of Everyday Upward, interesting quizzes and answers to learn about Everyday Upward, and playing the variety show of Everyday Upward. If the current scene of the smart terminal is ignored, there may be a deviation between the execution result and the user's actual intention.
  • the user's intention may be to watch Tiantianupward variety shows; if the current device status is not considered, the possible execution result is to display Tiantianupward's encyclopedia introduction, which results in There is a deviation between the execution result and the user's actual intention.
  • this application provides a user intent analysis method, which can not only be based on the user dimension, but also embed device dimension information during the intent analysis process, so that the intent analysis is more accurate, and finally the terminal device can accurately perform corresponding operations. , to improve user experience.
  • step S101 the speech text input by the user is obtained, and the speech text is analyzed using dependency syntax to generate at least two syntax trees.
  • the voice text is obtained by analyzing the voice signal input by the user. Specifically, the user inputs the voice signal within the range of the terminal device receiving the signal.
  • the terminal device may collect a voice signal input by the user through a microphone, and then obtain and recognize the voice text from the voice signal.
  • the voice text can be recognized by the voice recognition server.
  • Semantic analysis is performed on the speech text by the semantic server.
  • word segmentation processing is first performed on the speech text.
  • the thesaurus use the forward maximum matching method to perform word segmentation. For example, “Andy Lau's movie New Shaolin Temple”, after word segmentation processing, the word segmentation "Andy Lau, of, movie, new Shaolin Temple” is obtained.
  • LAC Large Analysis of Chinese
  • the LAC lexical analysis tool is a combined lexical analysis model that can complete Chinese word segmentation and part-of-speech tagging as a whole, and can also add a custom dictionary to identify proper names.
  • the input of the LAC lexical analysis task is a string, and the output is the word boundary and part of speech in the media title.
  • the dependency syntax is used to extract the user's intent in the speech text according to the result of part-of-speech tagging.
  • Dependency syntax analysis uses global search to generate multiple syntax trees, each sentence corresponds to one or more syntax trees, and each syntax tree has probability values and user intentions.
  • a common practice in related technologies is that the system outputs the syntax tree with the highest probability. Finally, the user intention of the syntax tree with the highest probability is determined as the user intention in the speech text.
  • word segmentation and part-of-speech tagging tools used in this application are not limited to the LAC lexical analysis tool, and other lexical analysis tools can also be used.
  • Step S102 determining the syntax tree with the user intent matching the device state information of the current device as the optimal syntax tree, and determining the user intent of the optimal syntax tree as the optimal user intent.
  • the device status information of the current device in this embodiment of the present application may include information such as device type, device mode, and terminal status.
  • the device type can be TV, refrigerator, speaker, etc.
  • the device mode can be TV mode, speaker mode, children’s mode, etc.
  • the terminal status can be the application or interface information that the device is currently in. Both the device mode and the terminal state are attached to the device type, so there are dependencies in the three dimensions.
  • all the syntax trees are matched with the device state information of the current device, and the matched syntax tree is the optimal syntax tree.
  • the server after receiving the voice command input by the user, the information corresponding to the voice command and the current device status of the device are sent to the server at the same time, and the server performs voice recognition and semantic analysis, and sends the corresponding The syntactic tree and the device status are combined to obtain the optimal syntactic tree, and finally media resources are recommended according to the final syntactic tree.
  • the current device is a display device
  • the device mode of the current device is a children's mode.
  • the user intent of syntax tree A is to play the live-action movie Mulan;
  • the user intent of syntax tree B is to play the cartoon Mulan.
  • the display device is not allowed to play live-action movies.
  • the user intention of the syntax tree A does not match the device state information of the current device, and the syntax tree A cannot be determined as the optimal syntax tree.
  • the display device is allowed to play cartoons.
  • the user intention of the syntax tree B matches the device status information of the current device, and the syntax tree B can be determined as the optimal syntax tree.
  • the user intention of syntax tree B to play cartoons is determined as the optimal user intention.
  • An example of device status information at the device interface level receive the "Two Liang Beef" input by the user, and analyze the syntax tree determined by the voice text input by the user.
  • the user intention of syntax tree A is to manage beef ingredients
  • the user intent of syntax tree B is The intention is to buy two catties of beef. If the current device is a smart refrigerator, if the device interface of the current device is an ingredient management interface. Then the user intention "manage beef ingredients" in the syntax tree A matches the device status information of the current device.
  • the syntax tree A can be determined as the optimal syntax tree, and the user intention "managing beef ingredients" possessed by the syntax tree A can be determined as the optimal user intention; if the inter-device interface of the current device is a shopping interface.
  • the syntax tree B can be determined as the optimal syntax tree, and the user intention "buy two catties of beef” possessed by the syntax tree B can be determined as the optimal user intention.
  • Step S103 if the probability values of all the syntax trees are not equal, and the user intention of the syntax tree with the largest probability value matches the device state information of the current device, then determine the syntax tree with the largest probability value as the optimal syntax tree, And determining the user intention possessed by the optimal syntax tree as the optimal user intention.
  • the probability values of all the syntax trees here are not equal, it can be that the probability values of all the syntax trees are not equal, or the probability values of at least two syntax trees among all the syntax tree probability values are not equal .
  • a syntax tree A, a syntax tree B, and a syntax tree C are obtained after performing semantic analysis on the speech and text input by the user. Among them, the probability values of the three syntax trees are not equal, and the probability value of syntax tree A is the largest. If the user intention of the syntax tree A matches the device state information of the current device, then determine the syntax tree A as the optimal syntax tree, and determine the user intention of the syntax tree A as the optimal user intention.
  • the method of the embodiment of the present application further includes sorting all the syntax trees according to the descending order of the probability values. For example, the syntax tree A, syntax tree B, and syntax tree C of the above embodiment, wherein the probability value of syntax tree A is 0.96, the probability value of syntax tree B is less than 0.96, and the probability value of syntax tree C is smaller than the probability value of syntax tree B. Then, according to the probability value, they are sorted from large to small: syntax tree A, syntax tree B, and syntax tree C.
  • the optimal syntax tree it is first judged whether the user intention of the syntax tree A matches the device state information of the current device. If the user intention of the syntax tree A matches the device status information of the current device, then the syntax tree A is determined to be the optimal syntax tree. If the user intention of syntax tree A does not match the device state information of the current device, it is further judged whether the deviation between the probability value of syntax tree B ranked second and the probability value of syntax tree A is less than the deviation threshold.
  • the difference between the probability value of the syntax tree B and the probability value of the syntax tree A is less than the deviation threshold, it is further determined whether the syntax tree B has the user intention and matches the device state information of the current device.
  • the syntax tree B If the user intention of the syntax tree B matches the device state information of the current device, then determine the syntax tree B as the optimal syntax tree, and determine the user intention of the syntax tree B as the optimal user intention. If the user intention of the syntax tree B does not match the device status information of the current device, the same judgment operation is further performed on the syntax tree C.
  • syntax tree A is still determined to be the optimal syntax tree.
  • a prompt may be displayed to the user, and the prompt is used to remind the user that the current device cannot perform the operation corresponding to the optimal user intention.
  • a manner of displaying the prompt to the user may be displaying the prompt on a display, or displaying the prompt by voice broadcast.
  • the traditional smart device media resource retrieval method relies on the user's explicit search intention. In some customized scenarios, if the user's clear search intent cannot be obtained, the smart device can only give the user a simple text reply, or even fail to give the user a reply. Therefore, the traditional smart device media asset retrieval method has poor user experience for users.
  • the application provides a media resource retrieval system, a frame diagram of the media resource retrieval system shown in FIG. 3 , the system includes a display device 200 and a server 400 .
  • the display device 200 further includes a display, a communicator, a sound collector, and a controller.
  • the display is used to display the user interface.
  • the communicator is used for data communication with the server 400 .
  • the voice collector user collects voice information input by the user.
  • the server 400 is configured to provide various media resource information and media resource data to the display device.
  • the process for the user to use the media asset retrieval system of this embodiment to perform media asset retrieval is specifically:
  • the user inputs an instruction for waking up the voice interaction function of the display device, and drives the sound collector to start working according to the instruction.
  • the tool for waking up the semantic interaction function of the display device can be a built-in or installed application, such as a voice assistant.
  • the way to wake up the voice assistant may be to wake up through the first voice information input by the user in the far field.
  • the first voice information is a preset wake-up word. Degree", or “Hisense Small Gathering” and other preset wake-up words, so as to wake up the voice interaction function of the display device.
  • the wake-up word can be set by the user, such as "I love my home", “TV TV” and so on.
  • the user may also directly touch the voice key on the remote controller, and the display device starts the voice assistant service according to the key instruction.
  • the user After waking up the voice interaction function of the display device, the user performs voice interaction with the display device, and the voice collector collects other voice information input by the user. If further search keywords that can be used to search for media asset content are not obtained from the sound collector, that is, no clear user intention can be obtained, an alternative media asset request is directly sent to the server.
  • the server receives the candidate media resource information searched according to the candidate media resource request, and feeds back the candidate media resource information to the display device. After receiving the candidate media resource information, the display device displays the candidate media resource information on the display.
  • the situation that the display device receives the voice instruction can be determined according to the situation that the voice collector collects the voice information.
  • the second voice information further input by the user is not received, or the search keyword cannot be recognized from the second voice information.
  • the process of identifying user intention from voice information is a related technology, and this application will not elaborate on it.
  • second voice information further input by the user is received, and search keywords are identified from the second voice information, but the identified search keywords cannot be used to search for media asset content.
  • the identified search keyword is not a preset keyword, that is, the search keyword is not a keyword indicating the business scope of the display device.
  • the server can also feed back corresponding media asset information according to different scenarios in which the display device is located. And display the corresponding media information on the monitor to avoid the occurrence of no reply.
  • the first scenario may be a scenario where there is no content input for a period of time after the user wakes up the voice assistant. For example, after the user enters the wake-up word "Hello, Xiaodu" and there is no content input, the search keyword for searching media asset content cannot be identified from the wake-up word. At this point, it may be determined that the current scene of the display device is the first scene, and the display device sends a media asset request to the server, where the media asset request carries information about the first scene. The server searches for corresponding first media asset information according to the first scene information, and feeds back the first media asset information.
  • the second scenario may be that the user further inputs voice information after waking up the voice assistant, and can identify search keywords from the input voice information.
  • this search keyword is not within the scope of the display device business. For example, after the user wakes up the voice assistant, and then enters the voice message "play XX game video".
  • the search keyword of "XX game video” can be identified from the voice information, "XX game video” is not a preset keyword, that is, XX game video is beyond the business scope of the display device.
  • the specific process of receiving from the server the candidate media resource information searched according to the candidate media resource request may be:
  • voiceprint information It is judged whether the voiceprint information can be determined from the first voice information, and if the voice information can be determined from the first voice information, then the voiceprint information is sent to the server.
  • the server determines the user profile based on the voiceprint information, and then searches for alternative media information based on the user profile.
  • Voiceprint information may include voiceprint ID and voiceprint attributes. If both the voiceprint ID and the voiceprint attribute can be determined from the first voice information, since each user has a unique voiceprint ID, the user profile is determined according to the voiceprint ID.
  • the voiceprint ID is sent to the server.
  • the server determines the user portrait uniquely corresponding to the voiceprint ID according to the voiceprint ID.
  • the server searches for candidate media resource information according to the determined user profile.
  • the display device may be a family TV, and at this time, the display device stores the voiceprint IDs of family members according to the voice access history.
  • the server stores the voiceprint IDs of grandpa, grandma, father, and mother.
  • the display device first sends the device ID of the display device to the server. The server searches for the voiceprint ID corresponding to the device according to the device ID.
  • the voiceprint ID of the grandfather Since the voiceprint ID of the grandfather is stored in advance, according to the characteristics of the voiceprint, it can be determined that the voiceprint ID of the grandfather can be recognized in the input voice information. Further determine the corresponding user portrait according to the grandfather's voiceprint ID. Then search for alternative media information based on the user portrait. In this way, the media asset information determined through the user portrait is related to the current user. If the guest uses the display device to input voice information, the display device first sends the ID of the display device to the server. Since the guest's voiceprint ID is not stored in advance. Then the server cannot determine the voiceprint ID according to the voice information.
  • the voiceprint attribute is sent to the server.
  • the server determines the corresponding user portrait according to the voiceprint attribute, and searches for candidate media resource information according to the user portrait.
  • the voiceprint attribute here may be a user characteristic of a type of user. The user characteristics may include the user's gender, age and other physiological characteristics.
  • the voiceprint attribute determined from the voice information is a middle-aged male
  • the determined user portrait corresponds to a middle-aged male.
  • the media information searched based on user portraits may be related to finance, automobiles, etc. If the voiceprint attribute determined from the voice information is a child, the determined user portrait corresponds to the child.
  • the media resource information searched based on the user portrait may be media resource information related to cartoons.
  • the identification history of the device can also be statistically displayed under the voiceprint feature. That is, statistics display all the voiceprint attributes recognized by the device. If the proportion of the recognition history record of a certain voiceprint attribute exceeds the preset threshold, the voiceprint attribute will be sent to the server. If the proportion of voiceprint attribute recognition history records exceeds the preset threshold, it means that the number of users of this type of display device is the most.
  • the voiceprint attribute is that the recognition history records of children account for more than 80%, it means that children users use the display device the most times. Send the child with the voiceprint attribute to the server, so that the server can feed back the media asset information corresponding to the child's user portrait.
  • the voiceprint ID or voiceprint attribute is then determined according to the voice information input by the user last time. It should be noted that the time between the moment when the user entered the voice information last time and the moment when the voice assistant is awakened does not exceed the preset time. For example, the time when the voice assistant is currently awakened is within 30 seconds from the time when the voice information was input last time.
  • the user portrait storage structure includes at least two tendency fields, and each tendency field includes at least two query dimensions.
  • the preference domain is set with the weight of the preferred domain
  • the query dimension is set with the weight of the query dimension.
  • Different user portrait storage results include different tendencies and query dimensions.
  • the user portrait includes the fields of preference "movie”, “music", “recipe”, “variety show” and so on.
  • the inclination field “movies” includes query dimensions “war movies” and “action movies”, etc.
  • the inclination field “music” also includes query dimensions “popular” and “popular”, etc.
  • the inclination field “recipes” also includes query dimensions “ “Cantonese Cuisine”, “Sichuan Cuisine”, etc.
  • the preferred field “Variety Show” also includes query dimensions “reality show”, “blind date”, etc.
  • the preferred domains in the above examples all have preferred domain weights, and the preferred domain weights can be set according to user portraits, for example, according to the number of viewing times of users.
  • Query dimensions also have query dimension weights, which can also be set according to user portraits.
  • the top rankings can be calculated using a weighted random algorithm. For example, the weights of the top three tendency fields are "movies", “music", and "recipes".
  • the media resource database in the embodiment of the present application is provided with at least two media resource cards, and the media resource cards correspond to the preferred fields.
  • media resource cards such as “movies”, “music” and “recipes” are set in the media resource library.
  • the media asset card is also set to have a weight.
  • the final card is selected according to the weight of the media capital card.
  • a weighted random algorithm can also be used. For example, the selected final card is "music", that is, the final determined field of inclination is "music".
  • a weighted random algorithm is used to determine the final query dimension. For example, determine the final query dimension as "popularity”.
  • the media resource query is performed based on the media resource card "music” and the query dimension "pop”.
  • the media asset information of the media asset card "music” and the query dimension "popular” can be randomly fed back to the user. For example, feed back media information about popular songs sung by Xu Wei.
  • different media resource libraries ie, card pools
  • the first scene is a scene in which no second voice information is input or a search keyword cannot be recognized from the second voice information, for example, it may be a scene in which no content is input for a period of time after the user wakes up the voice assistant.
  • the server stores the card pool shown in Table 1.
  • the card name card type 1 educate edu 2 broadcast fm 3 game game 4 application app 5 music client_music 6 help information client_helpinfo 7 TV drama tvplay 8 Movie film
  • the card pool stored by the server contains more guessed cards that the user may like.
  • the second scenario is that the search keyword can be recognized from the voice information input by the user, but the search keyword cannot be used to search for media content, that is, the user's intention is beyond the business scope of the display device.
  • the server stores the card pool shown in Table 2.
  • the embodiment shown in FIG. 4 is a scenario where there is no content input for a period of time after the user wakes up the voice assistant.
  • the display device can obtain three types of media asset cards from the server. All three types of cards are used to guide users to voice input.
  • the user of the first card in Fig. 4 guides the user to input voice information such as "some nice music”, “today's hot news”, “today's weather” and so on.
  • this application can also set specific card pools for other scenarios, and other scenarios can be system-side custom scenarios.
  • the voice message "good morning” when the voice message "good morning” is received, it may be determined that the current scene of the display device is a morning greeting scene. After that, identify the voiceprint ID or voiceprint attribute from the voice information, and obtain the media resource card for the morning greeting scene from the server according to the voiceprint ID or voiceprint attribute.
  • the voice message "I'm home” When the voice message "I'm home” is received, it can be determined that the current scene of the display device is the home scene. According to the voiceprint ID or voiceprint attribute, the media resource card for the home scene is obtained from the server.
  • the media resource card used to guide the operation of the APP interface can be obtained from the server.
  • the media resource card used to guide how to eliminate the fault can be obtained from the server.
  • the received voice message is a complaint message, for example, input the voice message "I am very tired today", after detecting the scene, the media card related to soothing music and funny movies can be obtained from the server.
  • different prompts may also be provided according to specific scenarios. For example, display the greeting “Good morning”, “Good evening”, etc. on the user interface according to the time. Or in the coming home scenario, display the greeting "Welcome home” on the user interface.
  • the embodiment of the present application provides a content recommendation method, as shown in the signaling diagram of the content recommendation method in Figure 5, the method includes the following steps:
  • Step 501 Receive an instruction input by a user for activating the voice interaction function, and drive the sound collector to start according to the instruction, wherein the instruction is input in the form of a first voice message or a button.
  • Step 502 After starting the sound collector, if no search keywords that can be used to search for media asset content are obtained, send the voiceprint information related to the first voice information to the server.
  • Step 503 After the display device receives the candidate media resource information fed back by the server, it displays the candidate media resource information on the display, wherein the candidate media resource data is the voiceprint associated with the server according to the first voice information. Information OK.
  • the embodiment of the present application provides another content recommendation method, as shown in FIG. 6, the method includes the following steps:
  • Step 601 Receive an instruction for activating the voice listening function, and drive the sound collector to start, wherein the instruction for activating the voice listening function may be pass.
  • Step 602 after starting the voice collector, but when no search keywords that can be used to search for media content are obtained from the voice collector, voiceprint information is extracted from the first voice information.
  • step 603 send a candidate media resource request to the server, and the candidate media resource request carries voiceprint information.
  • the server determines the corresponding user portrait according to the voiceprint information. According to the user portrait, the corresponding candidate media asset information is searched in the server's media asset database. The server feeds back the candidate media resource information to the display device. After the display device receives the fed back candidate media resource information, it displays the candidate media resource information on the display.
  • the computer instructions for realizing the method of the present invention can be carried by any combination of one or more computer-readable storage media.
  • a non-transitory computer-readable storage medium may include any computer-readable medium except the transitory propagating signal itself.
  • a computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples (non-exhaustive list) of computer-readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), computer Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • Computer program code for carrying out the operations of the present invention may be written in one or more programming languages, or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and conventional Procedural programming language—such as "C" language or similar programming language, especially Python language suitable for neural network computing and platform frameworks based on TensorFlow, PyTorch, etc. can be used.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. Where a remote computer is involved, the remote computer may be connected to the user computer via any kind of network, including a local area network (LAN) or wide area network (WAN), or to an external computer (e.g., via an Internet connection using an Internet service provider). ).
  • LAN local area network
  • WAN wide area network
  • Internet service provider e.g., via an Internet connection using an Internet service provider.

Abstract

A display device, which may execute dependency syntactic parsing on speech text, so as to obtain at least two syntactic trees. If the probability values of the syntactic trees are equal to each other, then the syntactic tree which has a user intention that matches device state information of the current device is determined to be an optimal syntactic tree (S102). If the probability values of the syntactic trees are not equal to each other, and a user intention of the syntactic tree which has the maximum probability value matches the device state information of the current device, then the syntactic tree which has the maximum probability value is determined to be the optimal syntactic tree, and the user intention of the optimal syntactic tree is determined to be an optimal user intention (S103).

Description

显示设备display screen
相关申请的交叉引用Cross References to Related Applications
本申请要求申请日为2021年7月29日,申请号为202110865048.9,和申请日为2021年8月16日,申请号为202110934690.8的中国申请的优先权,其全部内容引用于此。This application claims the priority of the Chinese application with the application date of July 29, 2021 and application number 202110865048.9, and the application date of August 16, 2021 and application number 202110934690.8, the entire contents of which are cited here.
技术领域technical field
本申请涉及语音交互技术领域,尤其涉及一种内容推荐方法及装置。The present application relates to the technical field of voice interaction, in particular to a content recommendation method and device.
背景技术Background technique
随着智能语音交互技术的发展,语音交互功能逐渐成为智能终端产品的标准配置。用户可利用语音交互功能,实现语音控制智能终端产品,进行看视频、听音乐、查天气、电视控制等一系列操作。With the development of intelligent voice interaction technology, the voice interaction function has gradually become the standard configuration of intelligent terminal products. Users can use the voice interaction function to realize voice control of smart terminal products, and perform a series of operations such as watching videos, listening to music, checking the weather, and controlling TV.
语音控制智能终端产品的过程通常是,语音识别模块将用户输入的语音识别为文本。之后语义分析模块对该文本进行词法句法和语义的分析,从而理解用户的意图。最后控制端根据理解结果控制智能终端产品进行相应的操作。The process of controlling smart terminal products by voice is usually that the voice recognition module recognizes the voice input by the user as text. Afterwards, the semantic analysis module analyzes the lexical syntax and semantics of the text, so as to understand the user's intention. Finally, the control terminal controls the intelligent terminal products to perform corresponding operations according to the understanding results.
发明内容Contents of the invention
本申请实施例提供一种用户意图分析方法,该方法包括:获取用户输入的语音文本,对所述语音文本进行语义分析处理,生成至少两棵句法树,其中,所述句法树具有概率值和用户意图,概率值为系统输出所述句法树的概率;在所述句法树的概率值均相等时,将用户意图与当前设备的设备状态信息匹配的所述句法树确定为最优句法树,以及将最优句法树具有的用户意图确定为最优用户意图;在所述句法树的概率值不均相等,且概率值最大的所述句法树具有的用户意图与当前设备的设备状态信息匹配时,将概率值最大的所述句法树确定为最优句法树,以及将最优句法树具有的用户意图确定为最优用户意图。An embodiment of the present application provides a user intention analysis method, the method includes: acquiring the voice text input by the user, performing semantic analysis on the voice text, and generating at least two syntax trees, wherein the syntax trees have probability values and The user intention, the probability value is the probability that the system outputs the syntax tree; when the probability values of the syntax trees are all equal, the syntax tree matching the user intention and the device state information of the current device is determined as the optimal syntax tree, And determining the user intent of the optimal syntax tree as the optimal user intent; the probability values in the syntax trees are not equal, and the user intent of the syntax tree with the largest probability value matches the device state information of the current device When , the syntax tree with the largest probability value is determined as the optimal syntax tree, and the user intent of the optimal syntax tree is determined as the optimal user intent.
附图说明Description of drawings
图1为根据一些实施例的语音交互原理的示意图;Fig. 1 is a schematic diagram of the principle of voice interaction according to some embodiments;
图2为根据一些实施例的用户意图分析方法流程示意图;Fig. 2 is a schematic flow chart of a method for analyzing user intent according to some embodiments;
图3为根据一些实施例的媒资检索系统的框架图;Figure 3 is a block diagram of a media asset retrieval system according to some embodiments;
图4为根据一些实施例中显示设备200中的用户界面示意图;FIG. 4 is a schematic diagram of a user interface in a display device 200 according to some embodiments;
图5为根据一些实施例的内容推荐方法信令图;Fig. 5 is a signaling diagram of a content recommendation method according to some embodiments;
图6为根据一些实施例的又一种内容推荐方法信令图。Fig. 6 is a signaling diagram of another content recommendation method according to some embodiments.
具体实施方式Detailed ways
为使本申请的目的和实施方式更加清楚,下面将结合本申请示例性实施例中的附图,对本申请示例性实施方式进行清楚、完整地描述,显然,描述的示例性实施例仅是本申请一部分实施例,而不是全部的实施例。In order to make the purpose and implementation of the application clearer, the following will clearly and completely describe the exemplary implementation of the application in conjunction with the accompanying drawings in the exemplary embodiment of the application. Obviously, the described exemplary embodiment is only the present application. Claim some of the examples, not all of them.
需要说明的是,本申请中对于术语的简要说明,仅是为了方便理解接下来描述的实施方式,而不是意图限定本申请的实施方式。除非另有说明,这些术语应当按照其普通和通常的含义理解。It should be noted that the brief description of the terms in this application is only for the convenience of understanding the implementations described below, and is not intended to limit the implementations of this application. These terms are to be understood according to their ordinary and usual meaning unless otherwise stated.
本申请中说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”等是用于区别类似或同类的对象或实体,而不必然意味着限定特定的顺序或先后次序,除非另外注明。应该理解这样使用的用语在适当情况下可以互换。The terms "first", "second", and "third" in the description and claims of this application and the above drawings are used to distinguish similar or similar objects or entities, and do not necessarily mean limiting specific sequential or sequential unless otherwise noted. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
术语“包括”和“具有”以及他们的任何变形,意图在于覆盖但不排他的包含,例如,包含了一系列组件的产品或设备不必限于清楚地列出的所有组件,而是可包括没有清楚地列出的或对于这些产品或设备固有的其它组件。The terms "comprising" and "having", as well as any variations thereof, are intended to be inclusive but not exclusive, for example, a product or device comprising a series of components is not necessarily limited to all components expressly listed, but may include not expressly listed other components listed or inherent to these products or equipment.
术语“模块”是指任何已知或后来开发的硬件、软件、固件、人工智能、模糊逻辑或硬件或/和软件代码的组合,能够执行与该元件相关的功能。The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code capable of performing the function associated with that element.
为清楚说明本申请的实施例,下面结合图1对本申请实施例提供的一种语音识别网络架构进行描述。In order to clearly illustrate the embodiment of the present application, a voice recognition network architecture provided by the embodiment of the present application will be described below with reference to FIG. 1 .
参见图1,图1为本申请实施例提供的一种语音识别网络架构示意图。图1中,智能设备用于接收输入的信息以及输出对该信息的处理结果。语音识别服务设备为部署有语音识别服务的电子设备,语义服务设备为部署有语义服务的电子设备,业务服务设备为部署有业务服务的电子设备。这里的电子设备可包括服务器、计算机等,这里的语音识别服务、语义服务(也可称为语义引擎)和业务服务为可部署在电子设备上的web服务,其中,语音识别服务用于将音频识别为文本,语义服务用于对文本进行语义解析,业务服务用于提供具体的服务如墨迹天气的天气查询服务、QQ音乐的音乐查询服务等。在一个实施例中,图1所示架构中可存在部署有不同业务服务的多个实体服务设备,也可以一个或多个实体服务设备中集合一项或多项功能服务。Referring to FIG. 1, FIG. 1 is a schematic diagram of a speech recognition network architecture provided by an embodiment of the present application. In FIG. 1 , the smart device is used to receive input information and output processing results of the information. The speech recognition service device is an electronic device deployed with a speech recognition service, the semantic service device is an electronic device deployed with a semantic service, and the business service device is an electronic device deployed with a business service. The electronic device here may include a server, a computer, etc., and the speech recognition service, semantic service (also called a semantic engine) and business service here are web services that can be deployed on the electronic device, wherein the speech recognition service is used for audio Recognized as text, the semantic service is used for semantic analysis of the text, and the business service is used to provide specific services such as the weather query service of Moji Weather, the music query service of QQ Music, etc. In an embodiment, there may be multiple entity service devices deployed with different business services in the architecture shown in FIG. 1 , or one or more functional services may be integrated in one or more entity service devices.
一些实施例中,下面对基于图1所示架构处理输入智能设备的信息的过程进行举例描述,以输入智能设备的信息为通过语音输入的查询语句为例,上述过程可包括如下三个过程:In some embodiments, the following is an example description of the process of processing the information input to the smart device based on the architecture shown in Figure 1. Taking the information input to the smart device as a query sentence input by voice as an example, the above process may include the following three processes :
[语音识别][Speech Recognition]
智能设备可在接收到通过语音输入的查询语句后,将该查询语句的音频上传至语音识别服务设备,以由语音识别服务设备通过语音识别服务将该音频识别为文本后返回至智能设备。在一个实施例中,将查询语句的音频上传至语音识别服务设备前,智能设备可对查询语句的音频进行去噪处理,这里的去噪处理可包括去除回声和环境噪声等步骤。After receiving the query sentence input by voice, the smart device can upload the audio of the query sentence to the voice recognition service device, so that the voice recognition service device can recognize the audio as text through the voice recognition service and return it to the smart device. In one embodiment, before uploading the audio of the query sentence to the speech recognition service device, the smart device may perform denoising processing on the audio of the query sentence, where the denoising processing may include steps such as removing echo and environmental noise.
[语义理解][semantic understanding]
智能设备将语音识别服务识别出的查询语句的文本上传至语义服务设备,以由语义服务设备通过语义服务对该文本进行语义解析,得到文本的业务领域、意图等。The smart device uploads the text of the query sentence recognized by the speech recognition service to the semantic service device, so that the semantic service device can perform semantic analysis on the text through the semantic service to obtain the business field and intention of the text.
[语义响应][semantic response]
语义服务设备根据对查询语句的文本的语义解析结果,向相应的业务服务设备下发查询指令以获取业务服务给出的查询结果。智能设备可从语义服务设备获取该查询结果并输出。作为一个实施例,语义服务设备还可将对查询语句的语义解析结果发送至智能设备,以由智能设备输出该语义解析结果中的反馈语句。According to the semantic analysis result of the text of the query statement, the semantic service device sends a query instruction to the corresponding business service device to obtain the query result given by the business service. The smart device can obtain and output the query result from the semantic service device. As an embodiment, the semantic service device can also send the semantic analysis result of the query sentence to the smart device, so that the smart device can output the feedback sentence in the semantic analysis result.
需要说明的是,图1所示架构只是一种示例,并非对本申请保护范围的限定。本申请实施例中,也可采用其他架构来实现类似功能,例如:三个过程全部或部分可以由智能终端来完成,在此不做赘述。It should be noted that the architecture shown in FIG. 1 is only an example, and does not limit the protection scope of the present application. In the embodiment of the present application, other architectures may also be used to implement similar functions. For example, all or part of the three processes may be completed by a smart terminal, which will not be described in detail here.
在一些实施例中,图1所示的智能设备可为显示设备,如智能电视,语音识别服务设备的功能可由显示设备上设置的声音采集器和控制器配合实现,语义服务设备和业务服务设备的功能可由显示设备的控制器实现,或者由显示设备的服务器来实现。In some embodiments, the smart device shown in Figure 1 can be a display device, such as a smart TV, and the function of the voice recognition service device can be realized by the cooperation of the sound collector and the controller set on the display device, and the semantic service device and business service device The functions of can be realized by the controller of the display device, or by the server of the display device.
随着智能语音交互技术的发展,语音交互功能逐渐成为智能终端产品的标准配置。用户可利用语音交互功能,实现语音控制智能终端产品,进行看视频、听音乐、查天气、电视控制等一系列操作。With the development of intelligent voice interaction technology, the voice interaction function has gradually become the standard configuration of intelligent terminal products. Users can use the voice interaction function to realize voice control of smart terminal products, and perform a series of operations such as watching videos, listening to music, checking the weather, and controlling TV.
为清楚说明本申请的实施例,下面对一些专业术语进行释义:In order to clearly illustrate the embodiments of the present application, some technical terms are explained below:
[声纹][Voiceprint]
声纹是用电声学仪器显示的携带言语信息的声波频谱,是由波长、频率以及强度等百余中特征维度组成的生物特征,具有未定型、可测量性、唯一性等特点。Voiceprint is the sound wave spectrum that carries speech information displayed by electro-acoustic instruments. It is a biological feature composed of more than a hundred characteristic dimensions such as wavelength, frequency, and intensity. It has the characteristics of undetermined, measurable, and unique.
目前主流的说话人聚类算法是在说话人分割的基础上,基于贝叶斯信息判据,采 用凝聚分层聚类算法,直接对说话人分割后的语音段进行判决,将属于同一个说话人的语音段合并为一类。其基本思想是从每个语片段中提取特征参数,例如梅尔倒谱参数,计算每两个语音段之间特征参数的相似度,并利用BIC判断相似度最高的两个语音段是否合并为同一类。对任意两段语音都进行上述判决,直到所有的语音段不再合并。The current mainstream speaker clustering algorithm is based on the speaker segmentation, based on the Bayesian information criterion, using the agglomerative hierarchical clustering algorithm to directly judge the speech segments after the speaker segmentation, which will belong to the same speaker Human speech segments are combined into one category. The basic idea is to extract feature parameters from each speech segment, such as Mel cepstrum parameters, calculate the similarity of feature parameters between each two speech segments, and use BIC to judge whether the two speech segments with the highest similarity are merged into same class. The above-mentioned judgments are performed on any two speech segments until all speech segments are no longer merged.
[用户画像][User portrait]
通过收集用户社会属性、消费习惯、偏好特征等各个维度数据,进而对用户或者产品特征属性的刻画,并对这些特征分析统计挖掘潜在价值信息,从而抽象出一个用户的全貌。用户画像是定向广告投放或个性化推荐的前提。By collecting data in various dimensions such as user social attributes, consumption habits, and preference characteristics, and then characterizing user or product characteristics, and analyzing and statistically mining potential value information on these characteristics, a complete picture of a user can be abstracted. User portraits are a prerequisite for targeted advertising or personalized recommendations.
语音控制智能终端产品的过程通常是,语音识别模块将用户输入的语音识别为文本。之后语义分析模块对该文本进行词法句法和语义的分析,从而理解用户的意图。再根据检索意图向智能设备反馈推荐的媒资信息或者媒资数据。The process of controlling smart terminal products by voice is usually that the voice recognition module recognizes the voice input by the user as text. Afterwards, the semantic analysis module analyzes the lexical syntax and semantics of the text, so as to understand the user's intention. Then feed back the recommended media information or data to the smart device according to the retrieval intention.
然而,目前的智能终端产品,在理解用户意图时,通常并未考虑设备当前所处的状态及所处的场景,仅基于用户维度或者网络热度对用户意图进行解析。例如,用户输入“天天向上”,可能搜索得到显示天天向上的百科介绍、趣味问答好好学习天天向上、播放天天向上综艺节目等结果。如果忽略智能终端当前所处场景,可能会产生执行结果与用户实际意图存在偏差的情况。例如,智能终端当前所处场景为视频播放应用下,此时用户的意图可能偏向于观看天天向上综艺节目;那么如果未考虑当前设备状态可能执行的结果为显示天天向上的百科介绍,这就产生执行结果与用户实际意图存在偏差的情况。However, current smart terminal products usually do not consider the current state and scene of the device when understanding user intentions, and only analyze user intentions based on user dimensions or network popularity. For example, if the user enters "Everyday Upward", the user may search for results such as showing the encyclopedia introduction of Everyday Upward, interesting quizzes and answers to learn about Everyday Upward, and playing the variety show of Everyday Upward. If the current scene of the smart terminal is ignored, there may be a deviation between the execution result and the user's actual intention. For example, when the smart terminal is currently in a video playback application, the user's intention may be to watch Tiantianupward variety shows; if the current device status is not considered, the possible execution result is to display Tiantianupward's encyclopedia introduction, which results in There is a deviation between the execution result and the user's actual intention.
为了解决上述问题,本申请提供一种用户意图分析方法,该方法能够在意图分析过程中不仅基于用户维度,还嵌入设备维度信息,从而使得意图分析更准确,最终使得终端设备能够准确执行对应操,提升用户使用体验。In order to solve the above problems, this application provides a user intent analysis method, which can not only be based on the user dimension, but also embed device dimension information during the intent analysis process, so that the intent analysis is more accurate, and finally the terminal device can accurately perform corresponding operations. , to improve user experience.
如图2的语义理解方法的流程示意图,该方法包括以下步骤:The flow diagram of the semantic understanding method as shown in Figure 2, the method includes the following steps:
步骤S101,获取用户输入的语音文本,利用依存句法分析该语音文本,生成至少两棵句法树。In step S101, the speech text input by the user is obtained, and the speech text is analyzed using dependency syntax to generate at least two syntax trees.
需要说明的事,在利用依存句法分析语音文本时,有可能只生成一棵句法树,那么就执行唯一对应的结果就行。本申请实施例以生成至少两棵句法树为示例进行方案的阐述。What needs to be explained is that when using dependency syntax to analyze speech text, it is possible to generate only one syntax tree, and then just execute the only corresponding result. The embodiment of the present application uses generating at least two syntax trees as an example to illustrate the solution.
语音文本为对用户输入的语音信号解析得到的。具体的,用户在终端设备接收信号的距离范围内输入语音信号。终端设备可以是通过麦克风采集用户输入的语音信号, 之后获取从语音信号中识别出语音文本。The voice text is obtained by analyzing the voice signal input by the user. Specifically, the user inputs the voice signal within the range of the terminal device receiving the signal. The terminal device may collect a voice signal input by the user through a microphone, and then obtain and recognize the voice text from the voice signal.
本申请实施例可由语音识别服务器识别出语音文本。由语义服务器对语音文本进行语义分析处理。具体的,首先对语音文本进行分词处理。可以以词库为依据,使用正向最大匹配法,进行分词。例如“刘德华的电影新少林寺”,分词处理后得到分词“刘德华,的,电影,新少林寺”。In this embodiment of the present application, the voice text can be recognized by the voice recognition server. Semantic analysis is performed on the speech text by the semantic server. Specifically, word segmentation processing is first performed on the speech text. Based on the thesaurus, use the forward maximum matching method to perform word segmentation. For example, "Andy Lau's movie New Shaolin Temple", after word segmentation processing, the word segmentation "Andy Lau, of, movie, new Shaolin Temple" is obtained.
进一步对分词进行词性标注,具体可以采用例如LAC(Lexical Analysis of Chinese)词法分析工具,对媒资标题进行中文分词和词性标注。LAC词法分析工具是一种联合的词法分析模型,能够整体性地完成中文分词和词性标注,还可添加自定义词典,对专有名称进行识别。LAC词法分析任务的输入是一个字符串,输出的则是媒资标题中的词边界和词性。之后利用依存句法,根据词性标注的结果,提取语音文本中用户意图。依存句法分析采用全局搜索,生成多棵句法树,每个句子对应一棵或多棵句法树,每一课句法树都具有概率值和用户意图。在相关技术中通常的做法是系统输出概率最高的句法树。最后将概率最高的句法树具有的用户意图,确定为该语音文本中的用户意图。To further perform part-of-speech tagging on word segmentation, for example, LAC (Lexical Analysis of Chinese) lexical analysis tools can be used to perform Chinese word segmentation and part-of-speech tagging on media titles. The LAC lexical analysis tool is a combined lexical analysis model that can complete Chinese word segmentation and part-of-speech tagging as a whole, and can also add a custom dictionary to identify proper names. The input of the LAC lexical analysis task is a string, and the output is the word boundary and part of speech in the media title. After that, the dependency syntax is used to extract the user's intent in the speech text according to the result of part-of-speech tagging. Dependency syntax analysis uses global search to generate multiple syntax trees, each sentence corresponds to one or more syntax trees, and each syntax tree has probability values and user intentions. A common practice in related technologies is that the system outputs the syntax tree with the highest probability. Finally, the user intention of the syntax tree with the highest probability is determined as the user intention in the speech text.
需要说明的是,本申请所使用的分词和词性标注工具不限于LAC词法分析工具,还可以使用其他的词法分析工具。It should be noted that the word segmentation and part-of-speech tagging tools used in this application are not limited to the LAC lexical analysis tool, and other lexical analysis tools can also be used.
步骤S102,将具有的用户意图与当前设备的设备状态信息匹配的句法树确定为最优句法树,以及将最优句法树具有的用户意图确定为最优用户意图。Step S102 , determining the syntax tree with the user intent matching the device state information of the current device as the optimal syntax tree, and determining the user intent of the optimal syntax tree as the optimal user intent.
本申请实施例中的当前设备的设备状态信息可以包括设备类型、设备模式以及终端状态等信息。设备类型可以是电视、冰箱、音箱等,设备模式可以是电视模式、音箱模式、少儿模式等,终端状态可以是设备当前所处的应用或者界面信息。设备模式和终端状态都是依附于设备类型,因此三个维度存在依赖关系。在确定最优句法树时,将所有的句法树与当前设备的设备状态信息进行匹配,匹配的句法树则为最优句法树。The device status information of the current device in this embodiment of the present application may include information such as device type, device mode, and terminal status. The device type can be TV, refrigerator, speaker, etc., the device mode can be TV mode, speaker mode, children’s mode, etc., and the terminal status can be the application or interface information that the device is currently in. Both the device mode and the terminal state are attached to the device type, so there are dependencies in the three dimensions. When determining the optimal syntax tree, all the syntax trees are matched with the device state information of the current device, and the matched syntax tree is the optimal syntax tree.
需要说明的是,不同的设备支持的技能不同,相同的设备在不同的模式下支撑的技能不同,相同的设备相同的模式下在不同的界面下支持的技能又不同。It should be noted that different devices support different skills, the same device supports different skills in different modes, and the same device supports different skills in different interfaces in the same mode.
本申请实施例中,接收到用户输入的语音指令后,将所述语音指令对应的信息以及所述设备当前的设备状态同时发送到服务器,所述服务器进行语音识别和语义解析后,将对应的句法树和设备状态进行综合运算,得出最优句法树,最终根据改最后句法树进行媒资推荐。In the embodiment of the present application, after receiving the voice command input by the user, the information corresponding to the voice command and the current device status of the device are sent to the server at the same time, and the server performs voice recognition and semantic analysis, and sends the corresponding The syntactic tree and the device status are combined to obtain the optimal syntactic tree, and finally media resources are recommended according to the final syntactic tree.
针对设备模式层面的设备状态信息的示例:当前设备为显示设备,并且当前设备的设备模式为少儿模式。接收“播放花木兰”的语音信息,解析出两各句法树。句法树 A具有的用户意图为播放真人电影花木兰;句法树B具有的用户意图为播放动画片花木兰。而在显示设备的设备模式为少儿模式时,不允许显示设备播放真人电影。则句法树A的用户意图与当前设备的设备状态信息不匹配,不可将句法树A确定为最优句法树。在显示设备的设备模式为少儿模式时,允许显示设备播放动画片。则句法树B的用户意图与当前设备的设备状态信息匹配,可将句法树B确定为最优句法树。最终将句法树B具有的用户意图播放动画片确定为最优用户意图。An example of the device status information at the device mode level: the current device is a display device, and the device mode of the current device is a children's mode. Receive the voice information of "Play Mulan" and parse out two syntax trees. The user intent of syntax tree A is to play the live-action movie Mulan; the user intent of syntax tree B is to play the cartoon Mulan. And when the device mode of the display device is the children's mode, the display device is not allowed to play live-action movies. Then the user intention of the syntax tree A does not match the device state information of the current device, and the syntax tree A cannot be determined as the optimal syntax tree. When the device mode of the display device is the kids mode, the display device is allowed to play cartoons. Then the user intention of the syntax tree B matches the device status information of the current device, and the syntax tree B can be determined as the optimal syntax tree. Finally, the user intention of syntax tree B to play cartoons is determined as the optimal user intention.
针对设备界面层面的设备状态信息的示例:接收用户输入的“二两牛肉”,分析用户输入的语音文本确定的句法树,句法树A具有的用户意图为管理牛肉食材,句法树B具有的用户意图为购买两斤牛肉。当前设备为智能冰箱,如果当前设备的设备界面为食材管理界面。则句法树A具有的用户意图“管理牛肉食材”与当前设备的设备状态信息匹配。可将句法树A确定为最优句法树,并可将句法树A具有的用户意图“管理牛肉食材”确定为最优用户意图;如果当前设备的设备间界面为购物界面。则句法树B具有的用户意图“购买两斤牛肉”与当前设备的设备状态信息匹配。可将句法树B确定为最优句法树,并可将句法树B具有的用户意图“购买两斤牛肉”确定为最优用户意图。An example of device status information at the device interface level: receive the "Two Liang Beef" input by the user, and analyze the syntax tree determined by the voice text input by the user. The user intention of syntax tree A is to manage beef ingredients, and the user intent of syntax tree B is The intention is to buy two catties of beef. If the current device is a smart refrigerator, if the device interface of the current device is an ingredient management interface. Then the user intention "manage beef ingredients" in the syntax tree A matches the device status information of the current device. The syntax tree A can be determined as the optimal syntax tree, and the user intention "managing beef ingredients" possessed by the syntax tree A can be determined as the optimal user intention; if the inter-device interface of the current device is a shopping interface. Then the user intention "buy two catties of beef" in the syntax tree B matches the device state information of the current device. The syntax tree B can be determined as the optimal syntax tree, and the user intention "buy two catties of beef" possessed by the syntax tree B can be determined as the optimal user intention.
步骤S103,如果所有的句法树的概率值不均相等,并且概率值最大的句法树具有的用户意图与当前设备的设备状态信息匹配,则将概率值最大的句法树确定为最优句法树,以及将最优句法树具有的用户意图确定为最优用户意图。需要说明的是,这里所有的句法树的概率值不均相等,可以是所有的句法树概率值均不相等,也可以是所有的句法树概率值中至少有两棵句法树的概率值不相等。Step S103, if the probability values of all the syntax trees are not equal, and the user intention of the syntax tree with the largest probability value matches the device state information of the current device, then determine the syntax tree with the largest probability value as the optimal syntax tree, And determining the user intention possessed by the optimal syntax tree as the optimal user intention. It should be noted that the probability values of all the syntax trees here are not equal, it can be that the probability values of all the syntax trees are not equal, or the probability values of at least two syntax trees among all the syntax tree probability values are not equal .
例如,对用户输入的语音取文本进行语义分析处理后得到句法树A、句法树B以及句法树C。其中,三棵句法树的概率值均不相等,句法树A的概率值最大。如果此时句法树A具有的用户意图与当前设备的设备状态信息匹配,则将句法树A确定为最优句法树,以及将句法树A具有的用户意图确定为最优用户意图。For example, a syntax tree A, a syntax tree B, and a syntax tree C are obtained after performing semantic analysis on the speech and text input by the user. Among them, the probability values of the three syntax trees are not equal, and the probability value of syntax tree A is the largest. If the user intention of the syntax tree A matches the device state information of the current device, then determine the syntax tree A as the optimal syntax tree, and determine the user intention of the syntax tree A as the optimal user intention.
在一些实施例中,如图3所示的流程图,本申请实施例的方法还包括按照概率值大小有大到小对所有句法树进行排序。例如上述实施例的句法树A、句法树B以及句法树C,其中句法树A的概率值为0.96,句法树B的概率值小于0.96,句法树C的概率值小于句法树B的概率值。则按照概率值大小由大到小排序为:句法树A、句法树B以及句法树C。In some embodiments, as shown in the flow chart of FIG. 3 , the method of the embodiment of the present application further includes sorting all the syntax trees according to the descending order of the probability values. For example, the syntax tree A, syntax tree B, and syntax tree C of the above embodiment, wherein the probability value of syntax tree A is 0.96, the probability value of syntax tree B is less than 0.96, and the probability value of syntax tree C is smaller than the probability value of syntax tree B. Then, according to the probability value, they are sorted from large to small: syntax tree A, syntax tree B, and syntax tree C.
在确定最优句法树时,首先判断句法树A具有的用户意图是否与当前设备的设备状态信息匹配。如果句法树A具有的用户意图与当前设备的设备状态信息匹配,则将句法树A确定为最优句法树。如果句法树A具有的用户意图与当前设备的设备状态信 息不匹配,则进一步判断排序为第二位的句法树B的概率值与句法树A的概率值的偏差值是否小于偏差阈值。When determining the optimal syntax tree, it is first judged whether the user intention of the syntax tree A matches the device state information of the current device. If the user intention of the syntax tree A matches the device status information of the current device, then the syntax tree A is determined to be the optimal syntax tree. If the user intention of syntax tree A does not match the device state information of the current device, it is further judged whether the deviation between the probability value of syntax tree B ranked second and the probability value of syntax tree A is less than the deviation threshold.
如果句法树B的概率值与句法树A的概率值的偏差值小于偏差阈值,进一步判断句法树B具有用户意图是否与当前设备的设备状态信息匹配。If the difference between the probability value of the syntax tree B and the probability value of the syntax tree A is less than the deviation threshold, it is further determined whether the syntax tree B has the user intention and matches the device state information of the current device.
如果句法树B的用户意图与当前设备的设备状态信息匹配,则将句法树B确定为最优句法树,以及将句法树B的用户意图确定为最优用户意图。如果句法树B的用户意图与当前设备的设备状态信息不匹配,则进一步对句法树C作同样的判断操作。If the user intention of the syntax tree B matches the device state information of the current device, then determine the syntax tree B as the optimal syntax tree, and determine the user intention of the syntax tree B as the optimal user intention. If the user intention of the syntax tree B does not match the device status information of the current device, the same judgment operation is further performed on the syntax tree C.
如果句法树B的概率值与句法树A的概率值的偏差值大于偏差阈值,则仍然将句法树A确定为最优句法树。此时,由于句法树A具有的用户意图与当前设备的设备状态信息不匹配,则可以向用户展示提示语,该提示语用于提示用户当前设备不可执行最优用户意图对应的操作。向用户展示提示语的方式可以是在显示器上显示提示语,或者通过语音播报的方式展示提示语。If the difference between the probability value of syntax tree B and the probability value of syntax tree A is greater than the deviation threshold, then syntax tree A is still determined to be the optimal syntax tree. At this time, since the user intention of the syntax tree A does not match the device state information of the current device, a prompt may be displayed to the user, and the prompt is used to remind the user that the current device cannot perform the operation corresponding to the optimal user intention. A manner of displaying the prompt to the user may be displaying the prompt on a display, or displaying the prompt by voice broadcast.
在一些实施例中,传统的智能设备媒资检索方式依赖于用户明确的搜索意图。在一些定制场景中,如果无法获得用户明确的搜索意图,智能设备只能给予用户简单的文本回复,甚至无法给予用户回复。因此,传统的智能设备媒资检索方式对于用户来说,使用体验较差。In some embodiments, the traditional smart device media resource retrieval method relies on the user's explicit search intention. In some customized scenarios, if the user's clear search intent cannot be obtained, the smart device can only give the user a simple text reply, or even fail to give the user a reply. Therefore, the traditional smart device media asset retrieval method has poor user experience for users.
申请提供一种媒资检索系统,如图3所示的媒资检索系统的框架图,该系统包括显示设备200以及服务器400。显示设备200又包括显示器、通信器、声音采集器以及控制器。显示器用于显示用户界面。通信器用于与服务器400进行数据通信。声音采集器用户采集用户输入的语音信息。服务器400用于向显示设备提供各种媒资信息和媒资数据。The application provides a media resource retrieval system, a frame diagram of the media resource retrieval system shown in FIG. 3 , the system includes a display device 200 and a server 400 . The display device 200 further includes a display, a communicator, a sound collector, and a controller. The display is used to display the user interface. The communicator is used for data communication with the server 400 . The voice collector user collects voice information input by the user. The server 400 is configured to provide various media resource information and media resource data to the display device.
在一些实施例中,用户利用本实施例的媒资检索系统进行媒资检索的过程具体为:In some embodiments, the process for the user to use the media asset retrieval system of this embodiment to perform media asset retrieval is specifically:
首先用户输入用于唤醒显示设备语音交互功能的指令,并根据所述指令驱动声音采集器启动工作。这种唤醒显示设备语义交互功能的工具可以为一个内置或安装的应用程序,如语音助手。First, the user inputs an instruction for waking up the voice interaction function of the display device, and drives the sound collector to start working according to the instruction. The tool for waking up the semantic interaction function of the display device can be a built-in or installed application, such as a voice assistant.
在一些可选的实施例方式中,唤醒语音助手的方式可以是通过用户远场输入的第一语音信息唤醒,例如,第一语音信息为预设的唤醒词,当用户输入“小度,小度”,或者“海信小聚”等预设的唤醒词,从而唤醒显示设备的语音交互功能。在一些可选的实施例中,唤醒词可以由用户自行设定,如“我爱我家”、“电视电视”等。In some optional embodiments, the way to wake up the voice assistant may be to wake up through the first voice information input by the user in the far field. For example, the first voice information is a preset wake-up word. Degree", or "Hisense Small Gathering" and other preset wake-up words, so as to wake up the voice interaction function of the display device. In some optional embodiments, the wake-up word can be set by the user, such as "I love my home", "TV TV" and so on.
在另一些可选的实施方式中,用户也可以直接触控遥控器上的语音键,显示设备根据该按键指令启动语音助手服务。In some other optional implementation manners, the user may also directly touch the voice key on the remote controller, and the display device starts the voice assistant service according to the key instruction.
唤醒显示设备的语音交互功能之后,用户与显示设备进行语音交互,声音采集器采集用户输入的其他语音信息。如果进一步未从声音采集器获取到可用于搜索媒资内容的搜索关键词,即无法获取到明确的用户意图,则直接向服务器发送备选媒资请求。服务器接收根据备选媒资请求查找的备选媒资信息,并向显示设备反馈备选媒资信息。显示设备接收到备选媒资信息之后,在显示器上显示该备选媒资信息。After waking up the voice interaction function of the display device, the user performs voice interaction with the display device, and the voice collector collects other voice information input by the user. If further search keywords that can be used to search for media asset content are not obtained from the sound collector, that is, no clear user intention can be obtained, an alternative media asset request is directly sent to the server. The server receives the candidate media resource information searched according to the candidate media resource request, and feeds back the candidate media resource information to the display device. After receiving the candidate media resource information, the display device displays the candidate media resource information on the display.
具体的可以根据声音采集器采集语音信息的情况确定显示设备接收语音指令的情况。Specifically, the situation that the display device receives the voice instruction can be determined according to the situation that the voice collector collects the voice information.
第一场景下,未接收到用户进一步输入的第二语音信息,或者不可从第二语音信息中识别出搜索关键词。其中,从语音信息中识别用户意图的过程为相关技术,本申请不作详细阐述。In the first scenario, the second voice information further input by the user is not received, or the search keyword cannot be recognized from the second voice information. Wherein, the process of identifying user intention from voice information is a related technology, and this application will not elaborate on it.
第二场景下,接收到用户进一步输入的第二语音信息,从第二语音信息中识别出搜索关键词,但是识别的搜索关键词不可用于搜索媒资内容。例如,识别出的搜索关键词不是预设关键词,即该搜索关键词不是指示显示设备业务范围的关键词。In the second scenario, second voice information further input by the user is received, and search keywords are identified from the second voice information, but the identified search keywords cannot be used to search for media asset content. For example, the identified search keyword is not a preset keyword, that is, the search keyword is not a keyword indicating the business scope of the display device.
经过上述实施例的媒资检索过程,即使不可获取到明确的用户意图,或者识别的用户意图不在显示设备业务范围内,服务器也可根据显示设备处于的不同场景,反馈相应的媒资信息。并在显示器上显示相应的媒资信息,避免无回复的情况发生。After the media asset retrieval process in the above embodiment, even if no clear user intention can be obtained, or the recognized user intention is not within the business scope of the display device, the server can also feed back corresponding media asset information according to different scenarios in which the display device is located. And display the corresponding media information on the monitor to avoid the occurrence of no reply.
示例性的,第一场景可以是用户唤醒语音助手后,一段时间内无内容输入的场景。例如,用户输入唤醒词“你好,小度”之后,再无内容输入,则无法从唤醒词中识别出用于搜索媒资内容的搜索关键词。此时,可以确定显示设备的当前场景为第一场景,显示设备向服务器发送媒资请求,该媒资请求携带有第一场景信息。服务器根据第一场景信息查找对应的第一媒资信息,并反馈第一媒资信息。Exemplarily, the first scenario may be a scenario where there is no content input for a period of time after the user wakes up the voice assistant. For example, after the user enters the wake-up word "Hello, Xiaodu" and there is no content input, the search keyword for searching media asset content cannot be identified from the wake-up word. At this point, it may be determined that the current scene of the display device is the first scene, and the display device sends a media asset request to the server, where the media asset request carries information about the first scene. The server searches for corresponding first media asset information according to the first scene information, and feeds back the first media asset information.
第二场景可以是用户唤醒语音助手后,进一步输入语音信息,并且可以从输入的语音信息中识别出搜索关键词。但是该搜索关键词并不在显示设备业务范围内。例如,用户唤醒语音助手之后,再输入语音信息“播放XX游戏视频”。虽然可从该语音信息中识别出“XX游戏视频”的搜索关键词,但是“XX游戏视频”不是预设关键词,即XX游戏视频超出了显示设备业务范围。The second scenario may be that the user further inputs voice information after waking up the voice assistant, and can identify search keywords from the input voice information. However, this search keyword is not within the scope of the display device business. For example, after the user wakes up the voice assistant, and then enters the voice message "play XX game video". Although the search keyword of "XX game video" can be identified from the voice information, "XX game video" is not a preset keyword, that is, XX game video is beyond the business scope of the display device.
一些实施例中,在未从声音采集器获取到可用于搜索媒资内容的搜索关键词时,从服务器接收根据备选媒资请求查找的备选媒资信息的具体过程可以是:In some embodiments, when the search keyword that can be used to search for media asset content is not obtained from the sound collector, the specific process of receiving from the server the candidate media resource information searched according to the candidate media resource request may be:
判断是否可从第一语音信息中确定声纹信息,如果可从第一语音信息中确定声音信息,则将声纹信息发送至服务器。服务器根据声纹信息确定用户画像,再根据用户画像查找备选媒资信息。声纹信息可以包括声纹ID和声纹属性。如果既可从第一语音 信息中确定声纹ID又可确定声纹属性,由于每个用户拥有唯一的声纹ID,则根据声纹ID确定用户画像。It is judged whether the voiceprint information can be determined from the first voice information, and if the voice information can be determined from the first voice information, then the voiceprint information is sent to the server. The server determines the user profile based on the voiceprint information, and then searches for alternative media information based on the user profile. Voiceprint information may include voiceprint ID and voiceprint attributes. If both the voiceprint ID and the voiceprint attribute can be determined from the first voice information, since each user has a unique voiceprint ID, the user profile is determined according to the voiceprint ID.
如果仅可从第一语音信息中确定声纹ID,则将声纹ID发送至服务器。服务器根据声纹ID确定与声纹ID唯一对应的用户画像。服务器再根据确定的用户画像查找备选媒资信息。If the voiceprint ID can only be determined from the first voice information, the voiceprint ID is sent to the server. The server determines the user portrait uniquely corresponding to the voiceprint ID according to the voiceprint ID. The server then searches for candidate media resource information according to the determined user profile.
需要说明的是,显示设备可以是家庭电视,则此时显示设备根据语音访问历史,保存有家庭成员的声纹ID。例如,服务器存储有爷爷、奶奶、爸爸、妈妈的声纹ID。当爷爷使用显示设备,输入语音信息时,显示设备首先将显示设备的设备ID发送至服务器。服务器根据设备ID查找该设备对应的声纹ID。It should be noted that the display device may be a family TV, and at this time, the display device stores the voiceprint IDs of family members according to the voice access history. For example, the server stores the voiceprint IDs of grandpa, grandma, father, and mother. When the grandfather uses the display device to input voice information, the display device first sends the device ID of the display device to the server. The server searches for the voiceprint ID corresponding to the device according to the device ID.
由于事先存储有爷爷的声纹ID,根据声纹特征,可以确定输入的语音信息中能够识别出爷爷的声纹ID。进一步根据爷爷的声纹ID确定对应的用户画像。再根据用户画像查找备选媒资信息。这样,通过用户画像确定出的媒资信息与当前用户具有关联性。如果客人使用显示设备,输入语音信息,显示设备首先将显示设备的ID发送至服务器。由于事先没有存储客人的声纹ID。则服务器不能根据语音信息确定声纹ID。Since the voiceprint ID of the grandfather is stored in advance, according to the characteristics of the voiceprint, it can be determined that the voiceprint ID of the grandfather can be recognized in the input voice information. Further determine the corresponding user portrait according to the grandfather's voiceprint ID. Then search for alternative media information based on the user portrait. In this way, the media asset information determined through the user portrait is related to the current user. If the guest uses the display device to input voice information, the display device first sends the ID of the display device to the server. Since the guest's voiceprint ID is not stored in advance. Then the server cannot determine the voiceprint ID according to the voice information.
在一些实施例中,如果不可从语音信息中确定声纹ID,但是可从语音信息中确定声纹属性,则将声纹属性发送至服务器。服务器根据声纹属性确定对应的用户画像,以及根据用户画像查找备选媒资信息。这里的声纹属性可以是一类用户的用户特征。用户特征可以包括用户的性别、年龄等生理特征。In some embodiments, if the voiceprint ID cannot be determined from the voice information, but the voiceprint attribute can be determined from the voice information, then the voiceprint attribute is sent to the server. The server determines the corresponding user portrait according to the voiceprint attribute, and searches for candidate media resource information according to the user portrait. The voiceprint attribute here may be a user characteristic of a type of user. The user characteristics may include the user's gender, age and other physiological characteristics.
例如,如果从语音信息中确定的声纹属性为中年男性,则确定的用户画像则为中年男性相对应。根据用户画像查找的媒资信息,可能是与财经、汽车等相关的媒资信息。如果从语音信息中确定的声纹属性为儿童,则确定的用户画像则与儿童相对应。根据用户画像查找的媒资信息,可能是与动画片相关的媒资信息。For example, if the voiceprint attribute determined from the voice information is a middle-aged male, then the determined user portrait corresponds to a middle-aged male. The media information searched based on user portraits may be related to finance, automobiles, etc. If the voiceprint attribute determined from the voice information is a child, the determined user portrait corresponds to the child. The media resource information searched based on the user portrait may be media resource information related to cartoons.
在一些实施例中,还可以在声纹特征下统计显示设备的识别历史记录。即统计显示设备所识别的所有声纹属性,如果某声纹属性识别历史记录占比超过预设阈值,则将该声纹属性发送至服务器。声纹属性识别历史记录占比超过预设阈值,表示使用该显示设备的该类用户次数最多。In some embodiments, the identification history of the device can also be statistically displayed under the voiceprint feature. That is, statistics display all the voiceprint attributes recognized by the device. If the proportion of the recognition history record of a certain voiceprint attribute exceeds the preset threshold, the voiceprint attribute will be sent to the server. If the proportion of voiceprint attribute recognition history records exceeds the preset threshold, it means that the number of users of this type of display device is the most.
例如,声纹属性为儿童的识别历史记录占比超过80%,则表示儿童用户使用该显示设备的次数最多。将声纹属性儿童发送至服务器,以使服务器反馈与儿童用户画像对应的媒资信息。For example, if the voiceprint attribute is that the recognition history records of children account for more than 80%, it means that children users use the display device the most times. Send the child with the voiceprint attribute to the server, so that the server can feed back the media asset information corresponding to the child's user portrait.
在一些实施例中,如果既不可从第一语音信息中确定声纹ID,也不可从第一语音信息中确定声纹属性。则根据用户前次输入的语音信息确定声纹ID或者声纹属性。需 要说明的是,用户前次输入语音信息的时刻,到当前唤醒语音助手的时刻之间的时长未超过预设时间。例如,当前唤醒语音助手的时刻距离前次输入语音信息的时间不超30秒。In some embodiments, if neither the voiceprint ID nor the voiceprint attribute can be determined from the first voice information. The voiceprint ID or voiceprint attribute is then determined according to the voice information input by the user last time. It should be noted that the time between the moment when the user entered the voice information last time and the moment when the voice assistant is awakened does not exceed the preset time. For example, the time when the voice assistant is currently awakened is within 30 seconds from the time when the voice information was input last time.
这样,大致可以确定本次唤醒语音助手的用户与前一次唤醒语音助手的用户是同一个人,当根据前次输入的语音信息确定的声纹ID推荐媒资时,结合了用户的习惯、喜好、年龄等因素,因此,给用户推荐的内容更能激发用户进行进一步的交互。In this way, it can be roughly determined that the user who wakes up the voice assistant this time is the same person as the user who woke up the voice assistant last time. Factors such as age, therefore, the content recommended to users can motivate users to make further interactions.
在一些实施例中,用户画像存储结构包括至少两个倾向领域,每个倾向领域又包括至少两项查询维度。倾向领域设置有倾向领域权重,查询维度设置有查询维度权重。不同的用户画像存储结果包括不同的倾向领域和查询维度。例如,用户画像中包括倾向领域“电影”、“音乐”、“菜谱”、“综艺”等。其中,倾向领域“电影”又包括查询维度“战争片”、“动作片”等,倾向领域“音乐”又包括查询维度“流行”、“通俗”等,倾向领域“菜谱”又包括查询维度“粤菜”、“川菜”等,倾向领域“综艺”又包括查询维度“真人秀”、“相亲”等。In some embodiments, the user portrait storage structure includes at least two tendency fields, and each tendency field includes at least two query dimensions. The preference domain is set with the weight of the preferred domain, and the query dimension is set with the weight of the query dimension. Different user portrait storage results include different tendencies and query dimensions. For example, the user portrait includes the fields of preference "movie", "music", "recipe", "variety show" and so on. Among them, the inclination field "movies" includes query dimensions "war movies" and "action movies", etc., the inclination field "music" also includes query dimensions "popular" and "popular", etc., and the inclination field "recipes" also includes query dimensions " "Cantonese Cuisine", "Sichuan Cuisine", etc., and the preferred field "Variety Show" also includes query dimensions "reality show", "blind date", etc.
上述示例中的倾向领域均具有倾向领域权重,倾向领域权重可以再根据用户画像,例如根据用户观看次数的多少进行设置。查询维度也均具有查询维度权重,同样也可以根据用户画像进行设置。首先根据倾向领域权重,可以利用加权随机算法计算排名前几,例如获取排名前三的倾向领域权重分别为“电影”、“音乐”、“菜谱”。The preferred domains in the above examples all have preferred domain weights, and the preferred domain weights can be set according to user portraits, for example, according to the number of viewing times of users. Query dimensions also have query dimension weights, which can also be set according to user portraits. First, according to the weight of the tendency field, the top rankings can be calculated using a weighted random algorithm. For example, the weights of the top three tendency fields are "movies", "music", and "recipes".
本申请实施例中的媒资库中设置有至少两张媒资卡片,媒资卡片与倾向领域对应。例如,媒资库中设置有“电影”、“音乐”、“菜谱”等媒资卡片。在媒资库中,媒资卡片也设置有权重。根据倾向领域权重计算得到排名前三的倾向领域后,再根据媒资卡片的权重,选出最终的卡片。同样也可以利用加权随机算法。例如,选出的最终卡片为“音乐”,即最终确定的倾向领域为“音乐”。The media resource database in the embodiment of the present application is provided with at least two media resource cards, and the media resource cards correspond to the preferred fields. For example, media resource cards such as "movies", "music" and "recipes" are set in the media resource library. In the media asset library, the media asset card is also set to have a weight. After calculating the top three preferred fields according to the weight of the inclined field, the final card is selected according to the weight of the media capital card. A weighted random algorithm can also be used. For example, the selected final card is "music", that is, the final determined field of inclination is "music".
确定最终的倾向领域“音乐”后,基于查询维度权重,再利用加权随机算法确定最终的查询维度。例如,确定最终的查询维度为“流行”。最后,通过垂询视频查询服务中的音乐查询服务,基于媒资卡片“音乐”和查询维度“流行”,进行媒资查询。最后可以随机向用户反馈媒资卡片“音乐”和查询维度“流行”的媒资信息。例如,反馈许巍演唱的相关流行歌曲的媒资信息。After determining the final inclination field "music", based on the weight of the query dimension, a weighted random algorithm is used to determine the final query dimension. For example, determine the final query dimension as "popularity". Finally, by inquiring about the music query service in the video query service, the media resource query is performed based on the media resource card "music" and the query dimension "pop". Finally, the media asset information of the media asset card "music" and the query dimension "popular" can be randomly fed back to the user. For example, feed back media information about popular songs sung by Xu Wei.
在一些实施例中,服务器中针对显示设备的不同场景,存储有不同的媒资库,即卡片池。第一场景为未输入第二语音信息或者不可从第二语音信息中识别出搜索关键词的场景,例如可以是用户唤醒语音助手后,一段时间内再无内容输入的场景。针对该场景,服务器存储有如表1所示的卡片池。In some embodiments, different media resource libraries, ie, card pools, are stored in the server for different scenarios of the display device. The first scene is a scene in which no second voice information is input or a search keyword cannot be recognized from the second voice information, for example, it may be a scene in which no content is input for a period of time after the user wakes up the voice assistant. For this scenario, the server stores the card pool shown in Table 1.
表1针对第一场景的卡片池Table 1 Card pool for the first scene
 the 卡片名称card name 卡片类型card type
11 教育educate eduedu
22 广播broadcast fmfm
33 游戏game gamegame
44 应用application appapp
55 音乐music client_musicclient_music
66 帮助信息help information client_helpinfoclient_helpinfo
77 电视剧TV drama tvplaytvplay
88 电影Movie filmfilm
针对第一场景,服务器存储的卡片池中更多的是猜测的用户可能喜欢的卡片。For the first scenario, the card pool stored by the server contains more guessed cards that the user may like.
第二场景为可从用户输入的语音信息识别出搜索关键词,但是搜索关键词不可用于搜索媒资内容,即用户意图超出显示设备业务范围内。针对该场景,服务器存储有如表2所示的卡片池。The second scenario is that the search keyword can be recognized from the voice information input by the user, but the search keyword cannot be used to search for media content, that is, the user's intention is beyond the business scope of the display device. For this scenario, the server stores the card pool shown in Table 2.
表2针对第二场景的卡片池Table 2 Card pool for the second scene
 the 卡片名称card name 卡片类型card type
11 应用application appapp
22 新闻news client_newsclient_news
33 音乐music client_musicclient_music
44 帮助信息help information client_helpinfoclient_helpinfo
88 电视剧TV drama tvplaytvplay
针对第二场景,服务器存储的卡片池中更多的是用于引导用户使用语音助手的卡片。For the second scenario, more cards in the card pool stored by the server are used to guide the user to use the voice assistant.
图4所示的实施例为用户唤醒语音助手后,一段时间内再无内容输入的场景。经过上述步骤,显示设备可以从服务器获取三种媒资卡片。这三种卡片均用于引导用户进行语音输入。图4中第一张卡片用户引导用户输入语音信息“来点好听的音乐”、“今天的热点新闻”、“今天的天气”等。The embodiment shown in FIG. 4 is a scenario where there is no content input for a period of time after the user wakes up the voice assistant. After the above steps, the display device can obtain three types of media asset cards from the server. All three types of cards are used to guide users to voice input. The user of the first card in Fig. 4 guides the user to input voice information such as "some nice music", "today's hot news", "today's weather" and so on.
除了上述实施例中的第一场景和第二场景,本申请还可以针对其他场景设置特定的卡片池,其他场景可以是系统端自定义场景。In addition to the first scenario and the second scenario in the above embodiment, this application can also set specific card pools for other scenarios, and other scenarios can be system-side custom scenarios.
例如,当接收到语音信息“早上好”,则可以确定显示设备当前场景为早上问候场景。之后,从语音信息中识别声纹ID或者声纹属性,根据声纹ID或者声纹属性从服 务器获取针对早上问候场景的媒资卡片。For example, when the voice message "good morning" is received, it may be determined that the current scene of the display device is a morning greeting scene. After that, identify the voiceprint ID or voiceprint attribute from the voice information, and obtain the media resource card for the morning greeting scene from the server according to the voiceprint ID or voiceprint attribute.
当接收到语音信息“我回家了”,则可以确定显示设备当前场景为回家场景。根据声纹ID或者声纹属性从服务器获取针对回家场景的媒资卡片。When the voice message "I'm home" is received, it can be determined that the current scene of the display device is the home scene. According to the voiceprint ID or voiceprint attribute, the media resource card for the home scene is obtained from the server.
当显示设备的用户界面长时间处于APP操作界面而未接收到用户的操作指令时,检测到该场景后,可从服务器获取用于引导操作APP界面的媒资卡片。When the user interface of the display device is in the APP operation interface for a long time without receiving any operation instructions from the user, after detecting this scene, the media resource card used to guide the operation of the APP interface can be obtained from the server.
当显示设备调用系统服务出现故障时,检测到该场景后,可从服务器获取用于引导如何消除故障的媒资卡片。When the display device fails to call the system service, after detecting the scene, the media resource card used to guide how to eliminate the fault can be obtained from the server.
当接收到的语音信息为抱怨信息,例如输入语音信息“我今天好累”,检测到该场景后,可从服务器获取与舒缓音乐、搞笑电影相关的媒资卡片。When the received voice message is a complaint message, for example, input the voice message "I am very tired today", after detecting the scene, the media card related to soothing music and funny movies can be obtained from the server.
在一些实施例中,从服务器获取媒资卡片并展示的同时,还可以根据具体场景提供不同的提示语。例如,根据时间在用户界面上显示问候语“早上好”、“晚上好”等。或者在回家场景中,在用户界面上显示问候语“欢迎回家”。In some embodiments, while the media asset card is obtained from the server and displayed, different prompts may also be provided according to specific scenarios. For example, display the greeting "Good morning", "Good evening", etc. on the user interface according to the time. Or in the coming home scenario, display the greeting "Welcome home" on the user interface.
本申请实施例提供一种内容推荐方法,如图5所示的内容推荐方法的信令图,所述方法包括以下步骤:The embodiment of the present application provides a content recommendation method, as shown in the signaling diagram of the content recommendation method in Figure 5, the method includes the following steps:
步骤501、接收用户输入的用于唤醒语音交互功能的指令,根据所述指令驱动声音采集器启动,其中,所述指令以第一语音信息方式或者按键方式输入。Step 501: Receive an instruction input by a user for activating the voice interaction function, and drive the sound collector to start according to the instruction, wherein the instruction is input in the form of a first voice message or a button.
步骤502、启动声音采集器之后,如果未获取到可用于搜索媒资内容的搜索关键词,则发送所述第一语音信息相关的声纹信息到所述服务器。Step 502: After starting the sound collector, if no search keywords that can be used to search for media asset content are obtained, send the voiceprint information related to the first voice information to the server.
步骤503,显示设备接收到服务器反馈的备选媒资信息后,在显示器上显示该备选媒资信息,其中,所述备选媒资数据是服务器根据所述第一语音信息相关的声纹信息确定。Step 503: After the display device receives the candidate media resource information fed back by the server, it displays the candidate media resource information on the display, wherein the candidate media resource data is the voiceprint associated with the server according to the first voice information. Information OK.
基于上述方法实施例,本申请实施例提供又一种内容推荐方法,如图6所示,所述方法包括以下步骤:Based on the above method embodiment, the embodiment of the present application provides another content recommendation method, as shown in FIG. 6, the method includes the following steps:
步骤601、接收用于启动语音收听功能的指令,驱动声音采集器启动,其中所述用于启动语音收听功能的指令可以是通过。Step 601: Receive an instruction for activating the voice listening function, and drive the sound collector to start, wherein the instruction for activating the voice listening function may be pass.
步骤602、启动声音采集器之后,但是未从声音采集器获取到可用于搜索媒资内容的搜索关键词时,从第一语音信息中提取声纹信息。Step 602, after starting the voice collector, but when no search keywords that can be used to search for media content are obtained from the voice collector, voiceprint information is extracted from the first voice information.
步骤603、则向服务器发送备选媒资请求,备选媒资请求携带有声纹信息。In step 603, send a candidate media resource request to the server, and the candidate media resource request carries voiceprint information.
服务器根据声纹信息确定对应的用户画像。根据用户画像在服务器的媒资库中查找对应的备选媒资信息。服务器将备选媒资信息反馈至显示设备。显示设备接收到反馈的备选媒资信息后,在显示器上显示备选媒资信息。The server determines the corresponding user portrait according to the voiceprint information. According to the user portrait, the corresponding candidate media asset information is searched in the server's media asset database. The server feeds back the candidate media resource information to the display device. After the display device receives the fed back candidate media resource information, it displays the candidate media resource information on the display.
一般来说,用以实现本发明方法的计算机指令的可以采用一个或众多计算机可读的存储介质的任意组合来承载。非临时性计算机可读存储介质可以包括任何计算机可读介质,除了临时性地传播中的信号本身。In general, the computer instructions for realizing the method of the present invention can be carried by any combination of one or more computer-readable storage media. A non-transitory computer-readable storage medium may include any computer-readable medium except the transitory propagating signal itself.
计算机可读存储介质例如可以是,但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或众多导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。A computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples (non-exhaustive list) of computer-readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), computer Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this document, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
可以以一种或多种程序设计语言或其组合来编写用以执行本发明操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言,特别是可以使用适于神经网络计算的Python语言和基于TensorFlow、PyTorch等平台框架。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算机,或,连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for carrying out the operations of the present invention may be written in one or more programming languages, or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and conventional Procedural programming language—such as "C" language or similar programming language, especially Python language suitable for neural network computing and platform frameworks based on TensorFlow, PyTorch, etc. can be used. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. Where a remote computer is involved, the remote computer may be connected to the user computer via any kind of network, including a local area network (LAN) or wide area network (WAN), or to an external computer (e.g., via an Internet connection using an Internet service provider). ).
尽管上面已经示出和描述了本发明的实施例,应当理解的是,上述实施例是示例性的,不能解释为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it should be understood that the above embodiments are exemplary and cannot be construed as limiting the present invention, and those skilled in the art can make the above-mentioned The embodiments are subject to changes, modifications, substitutions and variations.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalent technologies, the present application is also intended to include these modifications and variations.

Claims (8)

  1. 一种用户意图分析方法,包括:A user intent analysis method, comprising:
    获取用户输入的语音文本,对所述语音文本进行语义分析处理,生成至少两棵句法树,其中,所述句法树具有概率值和用户意图,概率值为系统输出所述句法树的概率;Obtain the voice text input by the user, perform semantic analysis on the voice text, and generate at least two syntax trees, wherein the syntax tree has a probability value and user intention, and the probability value is the probability that the system outputs the syntax tree;
    在所述句法树的概率值均相等时,将用户意图与当前设备的设备状态信息匹配的所述句法树确定为最优句法树,以及将最优句法树具有的用户意图确定为最优用户意图;When the probability values of the syntax trees are all equal, determine the syntax tree whose user intention matches the device state information of the current device as the optimal syntax tree, and determine the user intention of the optimal syntax tree as the optimal user intention;
    在所述句法树的概率值不均相等,且概率值最大的所述句法树具有的用户意图与当前设备的设备状态信息匹配时,将概率值最大的所述句法树确定为最优句法树,以及将最优句法树具有的用户意图确定为最优用户意图。When the probability values of the syntax trees are not equal, and the user intention of the syntax tree with the largest probability value matches the device state information of the current device, determine the syntax tree with the largest probability value as the optimal syntax tree , and determine the user intent possessed by the optimal syntax tree as the optimal user intent.
  2. 根据权利要求1所述的用户意图分析方法,所述方法还包括:The user intention analysis method according to claim 1, said method further comprising:
    在所述句法树的所述概率值均相等,且用户意图与当前设备的设备状态信息匹配的所述句法树存在多棵时,将用户意图对应的媒资资源搜索热度最高的所述句法树确定为最优句法树,以及将最优句法树具有的用户意图确定为最优用户意图。When the probability values of the syntax trees are all equal, and there are multiple syntax trees that the user intends to match with the device status information of the current device, search for the syntax tree with the highest popularity for the media resources corresponding to the user intention The optimal syntax tree is determined, and the user intention of the optimal syntax tree is determined as the optimal user intention.
  3. 根据权利要求1所述的用户意图分析方法,所述方法还包括:The user intention analysis method according to claim 1, said method further comprising:
    在所述句法树的所述概率值均相等,且用户意图与当前设备的设备状态信息匹配的所述句法树存在多棵时,将与当前设备的设备状态信息匹配的所述句法树具有的用户意图对应的媒资资源均向用户展示。When the probability values of the syntax trees are all equal, and there are multiple syntax trees that the user intends to match with the device state information of the current device, the syntax tree that matches the device state information of the current device has The media resources corresponding to the user intent are displayed to the user.
  4. 根据权利要求1所述的用户意图分析方法,所述方法还包括:按照概率值大小由大到小对所述句法树进行排序,在所述句法树的概率值不均相等,概率值最大的所述句法树具有的用户意图与当前设备的设备状态信息不匹配,且概率值排序位于第二位的所述句法树的概率值与概率值最大的所述句法树的概率值偏差小于偏差阈值时,将概率值排序位于第二位的所述句法树确定为最优句法树,以及将最优句法树具有的用户意图确定为最优用户最优句法树意图。According to the user intention analysis method according to claim 1, the method further comprises: sorting the syntax tree according to the probability value from large to small, and the probability values of the syntax tree are not equal, and the probability value is the largest The user intention of the syntax tree does not match the device state information of the current device, and the probability value of the syntax tree whose probability value ranks second is less than the deviation threshold from the probability value of the syntax tree with the largest probability value When , the syntax tree whose probability value ranks second is determined as the optimal syntax tree, and the user intent of the optimal syntax tree is determined as the optimal user optimal syntax tree intent.
  5. 根据权利要求4所述的用户意图分析方法,所述方法还包括:在概率值排序位于第二位的所述句法树具有的用户意图与当前设备的设备状态信息匹配时,将概率值排序位于第二位的所述句法树确定为最优句法树,以及将最优句法树具有的用户意图确定为最优用户意图;According to the user intention analysis method according to claim 4, the method further comprises: when the user intention of the syntax tree whose probability value is ranked second is matched with the device status information of the current device, ranking the probability value at the second position The second syntax tree is determined as an optimal syntax tree, and the user intention of the optimal syntax tree is determined as the optimal user intention;
    在概率值排序位于第二位的所述句法树具有的用户意图与当前设备的设备状态信息不匹配时,不将概率值排序位于第二位的所述句法树确定为最优句法树。When the user intention of the syntax tree ranked second in probability value does not match the device status information of the current device, the syntax tree ranked second in probability value is not determined as the optimal syntax tree.
  6. 根据权利要求1所述的用户意图分析方法,所述方法还包括:按照概率值大小由大到小对所述句法树进行排序,在所述句法树的所述概率值不均相等,概率值最大的所述句法树对应的用户意图与当前设备的设备状态信息不匹配,且概率值排序位于第二位的所述句法树的概率值与概率值最大的所述句法树的概率值偏差大于所述偏差阈值时,将概率值最大的所述句法树确定为最优句法树,以及将最优句法树具有的用户意图确定为最优用户意图。According to the user intention analysis method according to claim 1, the method further comprises: sorting the syntax tree according to the probability value from large to small, the probability values in the syntax tree are not equal, and the probability values The user intention corresponding to the largest syntax tree does not match the device status information of the current device, and the probability value of the syntax tree ranked second in the probability value deviation from the probability value of the syntax tree with the largest probability value is greater than When the deviation threshold is used, the syntax tree with the largest probability value is determined as the optimal syntax tree, and the user intention of the optimal syntax tree is determined as the optimal user intention.
  7. 根据权利要求6所述的用户意图分析方法,所述方法还包括:向用户展示提示语,所述提示语用于提示用户当前设备不可执行最优用户意图对应的操作。The user intention analysis method according to claim 6, further comprising: displaying prompts to the user, the prompts are used to prompt the user that the current device cannot perform the operation corresponding to the optimal user intention.
  8. 一种内容推荐方法,其特征在于,所述方法应用于显示设备,包括:A content recommendation method, characterized in that the method is applied to a display device, comprising:
    接收用户输入的用于唤醒语音交互功能的指令,根据所述指令驱动声音采集器启动,其中,所述指令以第一语音信息方式或者按键方式输入;receiving an instruction input by the user for waking up the voice interaction function, and driving the sound collector to start according to the instruction, wherein the instruction is input in the form of a first voice message or a button;
    在未从所述声音采集器获取到可用于搜索媒资内容的搜索关键词时,向所述服务器发送备选媒资请求;When no search keyword that can be used to search for media asset content is obtained from the sound collector, send an alternative media asset request to the server;
    从所述服务器接收根据所述备选媒资请求查找的备选媒资信息,以及在所述显示器上显示所述备选媒资信息。receiving, from the server, candidate media resource information searched according to the candidate media resource request, and displaying the candidate media resource information on the display.
PCT/CN2022/102456 2021-07-29 2022-06-29 Display device WO2023005580A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202280047134.1A CN117651943A (en) 2021-07-29 2022-06-29 Display apparatus

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110865048.9 2021-07-29
CN202110865048.9A CN113593559A (en) 2021-07-29 2021-07-29 Content display method, display equipment and server
CN202110934690.8 2021-08-16
CN202110934690.8A CN114281952A (en) 2021-08-16 2021-08-16 User intention analysis method and device

Publications (1)

Publication Number Publication Date
WO2023005580A1 true WO2023005580A1 (en) 2023-02-02

Family

ID=85087508

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/102456 WO2023005580A1 (en) 2021-07-29 2022-06-29 Display device

Country Status (2)

Country Link
CN (1) CN117651943A (en)
WO (1) WO2023005580A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012181676A (en) * 2011-03-01 2012-09-20 Nippon Telegr & Teleph Corp <Ntt> Base tree acquisition device, syntax analysis device, method, and program
CN107316643A (en) * 2017-07-04 2017-11-03 科大讯飞股份有限公司 Voice interactive method and device
US20180365228A1 (en) * 2017-06-15 2018-12-20 Oracle International Corporation Tree kernel learning for text classification into classes of intent
CN112396444A (en) * 2019-08-15 2021-02-23 阿里巴巴集团控股有限公司 Intelligent robot response method and device
CN112732951A (en) * 2020-12-30 2021-04-30 青岛海信智慧生活科技股份有限公司 Man-machine interaction method and device
CN113593559A (en) * 2021-07-29 2021-11-02 海信视像科技股份有限公司 Content display method, display equipment and server
CN114281952A (en) * 2021-08-16 2022-04-05 海信视像科技股份有限公司 User intention analysis method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012181676A (en) * 2011-03-01 2012-09-20 Nippon Telegr & Teleph Corp <Ntt> Base tree acquisition device, syntax analysis device, method, and program
US20180365228A1 (en) * 2017-06-15 2018-12-20 Oracle International Corporation Tree kernel learning for text classification into classes of intent
CN107316643A (en) * 2017-07-04 2017-11-03 科大讯飞股份有限公司 Voice interactive method and device
CN112396444A (en) * 2019-08-15 2021-02-23 阿里巴巴集团控股有限公司 Intelligent robot response method and device
CN112732951A (en) * 2020-12-30 2021-04-30 青岛海信智慧生活科技股份有限公司 Man-machine interaction method and device
CN113593559A (en) * 2021-07-29 2021-11-02 海信视像科技股份有限公司 Content display method, display equipment and server
CN114281952A (en) * 2021-08-16 2022-04-05 海信视像科技股份有限公司 User intention analysis method and device

Also Published As

Publication number Publication date
CN117651943A (en) 2024-03-05

Similar Documents

Publication Publication Date Title
US11809483B2 (en) Intelligent automated assistant for media search and playback
US20210020182A1 (en) Personalization of experiences with digital assistants in communal settings through voice and query processing
CN107507612B (en) Voiceprint recognition method and device
US9824150B2 (en) Systems and methods for providing information discovery and retrieval
US10932004B2 (en) Recommending content based on group collaboration
US11423890B2 (en) Dynamic and/or context-specific hot words to invoke automated assistant
TWI553494B (en) Multi-modal fusion based Intelligent fault-tolerant video content recognition system and recognition method
US9830404B2 (en) Analyzing language dependency structures
US9230547B2 (en) Metadata extraction of non-transcribed video and audio streams
CN109165302B (en) Multimedia file recommendation method and device
US20190205322A1 (en) Generating Command-Specific Language Model Discourses for Digital Assistant Interpretation
US11495229B1 (en) Ambient device state content display
WO2018045646A1 (en) Artificial intelligence-based method and device for human-machine interaction
US20140012859A1 (en) Personalized dynamic content delivery system
US10963495B2 (en) Automated discourse phrase discovery for generating an improved language model of a digital assistant
JP7171911B2 (en) Generate interactive audio tracks from visual content
CN114600081A (en) Interacting with applications via dynamically updating natural language processing
US20220308987A1 (en) Debugging applications for delivery via an application delivery server
WO2023005580A1 (en) Display device
US11385990B2 (en) Debugging applications for delivery via an application delivery server
KR102648990B1 (en) Peer learning recommendation method and device
WO2017096019A1 (en) Methods and apparatuses for enhancing user interaction with audio and visual data using emotional and conceptual content
EP4150479A1 (en) Real-time micro-profile generation using a dynamic tree structure
WO2023003537A1 (en) Bit vector-based content matching for third-party digital assistant actions
JP2003345823A (en) Access system and access control method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22848185

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE