CN110634477A - Context judgment method, device and system based on scene perception - Google Patents

Context judgment method, device and system based on scene perception Download PDF

Info

Publication number
CN110634477A
CN110634477A CN201810646326.XA CN201810646326A CN110634477A CN 110634477 A CN110634477 A CN 110634477A CN 201810646326 A CN201810646326 A CN 201810646326A CN 110634477 A CN110634477 A CN 110634477A
Authority
CN
China
Prior art keywords
voice instruction
voice
scene
context
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810646326.XA
Other languages
Chinese (zh)
Other versions
CN110634477B (en
Inventor
任晓楠
李霞
崔保磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Group Co Ltd
Original Assignee
Hisense Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Group Co Ltd filed Critical Hisense Group Co Ltd
Priority to CN201810646326.XA priority Critical patent/CN110634477B/en
Publication of CN110634477A publication Critical patent/CN110634477A/en
Application granted granted Critical
Publication of CN110634477B publication Critical patent/CN110634477B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Databases & Information Systems (AREA)
  • Signal Processing (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The embodiment of the invention provides a context judgment method, a context judgment device and a context judgment system based on scene perception. The context relation of the voice command and the change of the application scene are jointly used as the selection basis of the voice command analysis mode, the limitation of context interaction can be better broken through, and the accuracy of voice command analysis is improved in the scene that the application scene changes.

Description

Context judgment method, device and system based on scene perception
Technical Field
The present disclosure relates to the field of smart television technologies, and in particular, to a context determination method, device, and system based on scene awareness.
Background
With the rapid development of artificial intelligence, the combination of television services, intelligent voice and semantic understanding is more and more deep, and users can realize more convenient and faster service search through voice instructions.
In the application scenario of multi-round interaction, the continuously input voice commands usually have certain relevance, and the semantics of the subsequent commands can be deduced through the prior commands and the context. The existing context semantic understanding is mainly based on semantic rules, namely, a command which is input last time by a user is used for predicting a command which is likely to appear next time, and if the commands received twice meet preset grammar rules, the command received next time can be analyzed through context; if the preset grammar rule is not met, the fact that the received instructions do not have correlation is shown, and at the moment, the meaning of the next instruction needs to be independently analyzed. For example, the user enters a voice command: after the weather of the Qingdao is similar, the weather of the tomorrow is input, the two instructions have correlation, and the actual meaning of the second instruction of the user is easily analyzed through context, namely the weather of the tomorrow of the Qingdao.
The interactive process of the user and the television terminal is random, the searching function of the voice instruction under most scenes can realize direct speech, but the phenomenon that the execution result of the television terminal is inconsistent with the instruction intention of the user can occur under some scenes.
Disclosure of Invention
The embodiment of the invention provides a context judgment method, a context judgment device and a context judgment system based on scene perception, and aims to solve the problem that in the prior art, the context perception and the context prediction accuracy rate are easily reduced only according to context information input by a user voice.
In a first aspect, the present invention provides a speech recognition method based on scene awareness, which is characterized by comprising:
analyzing a first voice instruction sent by an equipment end, and returning first semantic information obtained by analysis to the equipment end so that the equipment end can execute the first voice instruction according to the first semantic information;
detecting whether a second voice instruction sent by a device end has a context relationship with the first voice instruction or not, wherein the receiving time of the second voice instruction is later than that of the first voice instruction;
when the second voice instruction and the first voice instruction have a context relationship, detecting whether the first voice instruction and the second voice instruction are in the same application scene;
if the first voice instruction and the second voice instruction are in the same application scene, analyzing the second voice instruction through context, and returning second semantic information obtained through analysis to the equipment end;
and if the first voice instruction and the second voice instruction are in different application scenes, independently analyzing the second voice instruction, and returning third semantic information obtained by analysis to the equipment end so that the equipment end can execute the second voice instruction according to the second semantic information or the third semantic information.
In a second aspect, the present invention provides a context determination apparatus based on scene awareness, including:
the first analysis module is used for analyzing a first voice instruction sent by the equipment end and returning first semantic information obtained by analysis to the equipment end so that the equipment end can execute the first voice instruction according to the first semantic information;
the first detection module is used for detecting whether a second voice instruction sent by a device end has a context relationship with the first voice instruction or not, and the receiving time of the second voice instruction is later than that of the first voice instruction;
the second detection module is used for detecting whether the first voice instruction and the second voice instruction are in the same application scene or not when the second voice instruction and the first voice instruction have a context relationship;
the second analysis module is used for analyzing a second voice instruction through a context relationship when the first voice instruction and the second voice instruction are in the same application scene, and returning second semantic information obtained through analysis to the equipment terminal;
and the third analysis module is used for independently analyzing the second voice instruction when the first voice instruction and the second voice instruction are in different application scenes, and returning third semantic information obtained by analysis to the equipment terminal so that the equipment terminal can execute the second voice instruction according to the second semantic information or the third semantic information.
In a third aspect, the invention provides a context judgment system based on scene awareness, comprising a television terminal and a cloud server, wherein,
the cloud server comprises the context judgment device based on scene perception, and is used for analyzing semantic information of the voice instruction according to the voice instruction and the scene information sent by the television terminal, so that the television terminal can execute the voice instruction according to the semantic information.
The beneficial effect of this application is as follows:
the embodiment of the invention provides a context judgment method, a context judgment device and a context judgment system based on scene perception. The context relation of the voice command and the change of the application scene are jointly used as the selection basis of the voice command analysis mode, the limitation of context interaction can be better broken through, and the accuracy of voice command analysis is improved in the scene that the application scene changes.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a flowchart of a method for context determination based on scene awareness according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a method of step S210 according to an embodiment of the present disclosure;
fig. 3 is a flowchart of a method of step S310 according to an embodiment of the present disclosure;
fig. 4 is a flowchart of a method of step S110 according to an embodiment of the present disclosure;
fig. 5 is a flowchart of a method of step S214 according to an embodiment of the present disclosure;
fig. 6 is a flowchart of another context determination method based on scene awareness according to an embodiment of the present application;
FIG. 71 is a diagram illustrating an interaction result of a conventional context determination method;
fig. 72 is a diagram of an interaction result of a context determination method based on scene awareness according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a context determining apparatus based on scene awareness according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a context determining apparatus based on scene sensing according to an embodiment of the present disclosure;
fig. 10 is a schematic structural diagram of a context determination system based on scene awareness according to an embodiment of the present application.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The interactive process of the user and the television terminal is random, the search function of the voice command in most scenes can realize direct speech, but the voice operation is slightly complicated in some scenes, for example, the user can close or quit a certain interface by directly pressing a return key, the user needs to press the voice key in comparison, and then inputs the voice 'quitting' or 'returning' voice command, the interactive mode through the physical key is quicker and more convenient, so that the user can switch the operation mode according to personal intention in the voice interactive process. Based on the above scenario, if only the context information input by the user voice is considered and other interactive operations are ignored for context sensing and prediction, the accuracy of context sensing and context prediction is reduced, which results in a phenomenon that the execution result of the television terminal is inconsistent with the instruction intention of the user.
Aiming at the problem that the context sensing and the accuracy rate of context prediction are easily reduced only according to the context information input by the voice of a user in the prior art, the application provides a context judgment method, a device and a system based on scene sensing, and the core idea is as follows: determining that a context relationship exists between a second voice instruction received later and a first voice instruction received earlier, and then detecting whether the two instructions are in the same application scene, if the two application scenes are not changed, the context relationship between the two voice instructions is valid, the cloud server analyzes the second voice instruction by using the context relationship, if the two application scenes are changed, the context relationship between the two voice instructions is invalid, and when the cloud server analyzes the second voice instruction, the cloud server does not combine the context relationship with the first voice instruction and independently analyzes the second voice instruction. The context relation of the voice command and the change of the application scene are jointly used as the selection basis of the voice command analysis mode, the limitation of context interaction can be better broken through, and the accuracy of voice command analysis is improved in the scene that the application scene changes. The embodiments provided in the present application will be described in detail with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of a method for context determination based on scene awareness according to an embodiment of the present application is shown. The method is applied to a cloud server side, the cloud server side is relative to an equipment side, the cloud server has storage and operation functions, can store voice instruction information and application scene information reported by the equipment side, scene database information and the like, and can perform instruction semantic analysis according to the received voice instruction information and application scene information, wherein the voice instruction information and the application scene information are text information generated by television equipment generally. Specifically, as can be seen from fig. 1, the method comprises the following steps:
step S110: and analyzing the first voice command sent by the equipment terminal, and returning the analyzed first semantic information to the equipment terminal so that the equipment terminal executes the first voice command according to the first semantic information.
In an actual scene of man-machine interaction, a user and a television terminal usually have a multi-round interactive application scene, in the multi-round interaction, the user can send a plurality of voice instructions, certain relevance exists among the voice instructions, namely, the voice instructions have context, and the semantics of the voice instructions sent later can be inferred through the voice instructions sent earlier. In this embodiment, the first voice command is a first voice command of a multi-round interaction.
Step S210: detecting whether a second voice instruction sent by a device end has a context relationship with the first voice instruction, if so, executing the step S310, wherein the receiving time of the second voice instruction is later than that of the first voice instruction.
Referring to fig. 2, a flowchart of a method of step S210 according to an embodiment of the present disclosure is shown. As can be seen from fig. 2, step S210 includes the following steps:
step S211: and receiving the first application scene information and the second application scene information sent by the equipment terminal.
The first application scenario information comprises an application program name, a display interface name and an interface name of the interactive client, wherein the application program name, the display interface name and the interface name are operated by the equipment terminal when a first voice instruction is received, and the second application scenario information comprises the application program name, the display interface name and the interface name of the interactive client, wherein the application program name, the display interface name and the interface name are operated by the equipment terminal when a second voice instruction is received. In this embodiment, the device side sends the first voice instruction and the first application scenario information at the same time, sends the second voice instruction and the second application scenario information at the same time, and the cloud server stores the received first application scenario information and the received second application scenario information in the memory, so as to perform the subsequent scenario switching and the subsequent context judgment. Of course, in other embodiments of the present application, the device side may also send the first application scenario information and the second application scenario information to the cloud server at the same time.
Step S212: detecting whether a preset scene database simultaneously contains the first application scene information and the second application scene information.
In this embodiment, some application scenarios cannot support application for parsing the semantic information of the voice instruction through the context, so that a preliminary judgment is made on the possibility of parsing through the context through a scene database pre-stored in the cloud server in this embodiment. The scene database includes multiple application scenes capable of performing context analysis, each scene includes corresponding scene information, the scene information includes an application program name (i.e., a package name of a package) operated by the device side, a display interface name, and an interface name of the interactive client, and a specific setting form of the scene database refers to table 1.
Table 1: storage form table of scene database
Bag name (packagemame) Display interface name (classname) Interface name (scene name)
com.tencent.qqmusic com.tencent.qqmusic.playview Song playing scene
com.tencent.qqmusic com.tencent.qqmusic.listview Song list scenario
com.xiaomi.videochat com.hisense.videochat.contactview Video call contact scene
com.xiaomi.videochat com.hisense.videochat.talkview Video call scenario
…… …… ……
When the scene database does not contain the first application scene information and the second application scene information at the same time, step S213 is executed, that is, it indicates that the application program executed by one or both of the first voice instruction and the second voice instruction cannot analyze semantics through context, and the application program and the second voice instruction do not have context relationship, and need to jump out multiple rounds of interaction to analyze the second voice instruction separately; when the scene database contains the first application scene information and the second application scene information, step S214 is executed, that is, it indicates that the application program executed by the first voice instruction and the second voice instruction can analyze the semantics through the context, the two have the context relationship, and the semantic information of the second voice instruction can be analyzed through the pre-established knowledge base.
Step S213: confirming that the second voice instruction does not have a contextual relationship with the first voice instruction.
Step S214: and confirming the context relationship between the second voice instruction and the first voice instruction according to a pre-constructed knowledge base. The knowledge base comprises a plurality of business modules, and the business modules comprise a plurality of semantic slots of business dimension information.
Step S310: detecting whether the first voice instruction and the second voice instruction are in the same application scene.
If the first voice instruction and the second voice instruction are in the same application scene, it is indicated that the context relationship between the two voice instructions is valid, the cloud server continues to use the context relationship to analyze the second voice instruction, and step S410 is executed;
if the first voice instruction and the second voice instruction are in different application scenes, it indicates that the context relationship between the two previous and next voice instructions is invalid, and when the cloud server analyzes the second voice instruction, the cloud server does not combine the context relationship with the first voice instruction, but independently analyzes the second voice instruction, and step S510 is executed. In this embodiment, the factor causing the application scene change may be a voice instruction or an operation instruction sent by a user through a remote controller.
In this embodiment, the criterion for determining whether the first voice command and the second voice command are in the same application scenario is related to an attribute of the application program, if the software running at the corresponding time is third-party software (such as QQ, WeChat, microblog, and the like), whether the application scenario changes may be determined by an application name (i.e., a package name), and when the application name does not change, if the application names corresponding to the first voice command and the second voice command are both com. If the software running at the corresponding moment is built-in software (such as searching services of nearby food, hospitals and the like), the change condition of the application scene can be judged through the change of the display interface name. Whether the running software is the third-party software can be detected through application scene information reported by the equipment terminal, and if the name of the application program in the application scene information is null, the software is the built-in software of the local machine.
Specifically, please refer to fig. 3, which illustrates a flowchart of a method of step S310 according to an embodiment of the present application. As can be seen from fig. 3, step S310 includes the following steps:
step S311: detecting whether the application program operated by the equipment end is third-party software or not when the equipment end receives the second voice instruction; if the device end receives the second voice instruction, the application program run by the device end is third-party software, step S312 is executed, and if the device end receives the second voice instruction, the application program run by the device end is not third-party software, step S315 is executed.
Step S312: detecting whether the application program name in the first application scene information is the same as the application program name in the second application scene information; if the two signals are the same, step S313 is executed, and if the two signals are not the same, step S314 is executed.
Step S313: confirming that the first voice instruction and the second voice instruction are in the same application scene;
step S314: confirming that the first voice instruction and the second voice instruction are in different application scenes;
step S315: detecting whether the display interface name in the first application scene information is the same as the display interface name in the second application scene information; if the two signals are the same, step S313 is executed, and if the two signals are not the same, step S314 is executed.
Step S410: and analyzing the second voice command through the context relationship, and returning the analyzed second semantic information to the equipment terminal.
Step S510: and independently analyzing the second voice command, and returning third semantic information obtained by analysis to the equipment terminal so that the equipment terminal executes the second voice command according to the second semantic information or the third semantic information.
Referring to fig. 4, a flowchart of a method of step S110 according to an embodiment of the present disclosure is shown. As shown in fig. 4, in other embodiments of the present application, step S110 may further include the following steps:
step S111: extracting a first keyword in the first voice instruction;
step S112: searching a first service module corresponding to the first voice instruction according to the first keyword;
step S113: dividing the service dimension of the first keyword according to the semantic slot corresponding to the first service module;
step S114: and analyzing first semantic information of the first voice command according to the service dimension of the first keyword.
Referring to fig. 5, a flowchart of a method of step S214 according to an embodiment of the present disclosure is shown. As shown in fig. 5, in other embodiments of the present application, step S214 may further include the following steps:
step S2141: extracting a second keyword in the second voice instruction;
step S2142: and when the second keyword and the first keyword have the same service dimension, determining whether the second voice instruction and the first voice instruction have a context relationship.
The knowledge base is provided with a plurality of business modules such as videos, music, television control, application, shopping, ticketing, gourmet and stocks, business dimension information is abstracted according to the field characteristics of each business, data is stored in the knowledge base according to different business dimension information, and the storage mode of the knowledge base adopts the following steps: databases, knowledge maps, and the like. For another example: the movie service can support the query dimensions including: the movie name, the type, the year, the region, the language, the award, the actor, the director and other service dimension information.
The method for constructing the knowledge base is briefly described below by taking a weather service as an example. The query dimension information of the weather service comprises time, place, weather phenomenon, air quality and weather index. Wherein, the time dimension can be relative time (such as today, tomorrow, next week, etc.); absolute time (e.g., 3/8, spring festival, etc.); the location dimension may include specific location information, such as: qingdao, sunshine, constant water, etc.; weather phenomenon dimensions may include sunny days, cloudy days, rain, temperature, etc.; the air mass dimension may include air mass, PM2.5, etc.; the weather index dimension may include a dressing index, a sports index, a car wash index, and the like.
In this embodiment, the semantic slot is used as a knowledge representation of the dimension information of the service query, and the natural language understanding process is a process of analyzing the user input into the predefined semantic slot. Taking a weather service as an example, the dimension information in the semantic slot may include natural language understanding such as service classification, service target attribute, time, location, weather keyword, weather phenomenon word, air quality, and air index, and the process of parsing using the context is essentially a process of parsing the user input into a predefined semantic slot. The following describes the context resolution process based on an example. For example, the user enters: "weather in Qingdao", user input below: the specific analysis process of the Beijing woolen cloth is as follows:
firstly, aiming at the semantic understanding of the 'weather of the Qingdao', filling information of the points in the filling groove into the Qingdao;
when the user enters the following input: "beijing woolen" alone analyzes the single-round input and cannot know the user's intention. And positioning the weather service in combination with the analysis of the above, wherein the Beijing is the query dimension of the weather service and the information with the same dimension as the Qingdao, and at the moment, replacing the semantic slot, namely filling the information of the place in the slot into the Beijing.
The location service at this time is a weather service, and the filling of the slot value is completed. Assuming that the user inputs "tomorrow," the same analysis process, except that it is now supplemented with a date slot value, completes the detailed analysis of the context.
Corresponding to the context judgment method based on the cloud server side, the embodiment of the application also provides a context judgment method applied to the equipment side. In this embodiment, the device side is a television terminal having a voice recognition function and a remote control function, the television terminal can receive a voice instruction and a remote control instruction sent by a user, and acquire application scene information of the terminal device at different times, before executing the voice instruction, the received voice instruction and the application scene information corresponding to the voice instruction can be sent to the cloud server together, the cloud server executes the context judgment method in the above embodiment, the received voice instruction can be analyzed into a semantic instruction, and then the semantic instruction is returned to the television terminal, and the television terminal executes a related operation according to the semantic instruction.
Referring to fig. 6, a flowchart of another context determination method based on scene awareness according to an embodiment of the present application is shown. As can be seen from fig. 6, the method comprises the following steps:
step S120: the method comprises the steps of obtaining a first voice instruction sent by a user, and sending the first voice instruction and application scene information corresponding to the first voice instruction to a cloud server side.
Step S220: and executing the first voice instruction according to the first semantic information returned by the cloud server.
Step S320: and acquiring a second voice instruction sent by a user, and sending the second voice instruction and application scene information corresponding to the second voice instruction to a cloud server side.
Step S420: and executing the second voice instruction according to the second semantic information or the third semantic information returned by the cloud server. When the first voice instruction and the second voice instruction are in the same application scene, the cloud server end returns second semantic information, and when the first voice instruction and the second voice instruction are in different application scenes, the cloud server end returns third semantic information. According to the technical scheme, the context relation of the voice command and the change of the application scene are jointly used as the selection basis of the voice command analysis mode, the limitation of context interaction can be better broken through, and the accuracy of voice command analysis is favorably improved in the scene that the application scene changes.
Please refer to fig. 71 and 72, which are an interaction result diagram of a conventional context determination method and an interaction result diagram of a context determination method based on scene sensing according to an embodiment of the present application. As can be seen from comparison between fig. 71 and 72, the human-computer interaction processes in fig. 71 and 72 are consistent, but after multiple rounds of human-computer interaction, the instruction execution results fed back in the last round have a large difference.
Referring to FIG. 71, it can be seen that the interaction process shown in this figure is as follows: 1. the user inputs a first voice instruction: "songs by Liudebua"; 2. the television terminal opens a music playing interface and plays the songs of Liudebua; 3. the user closes the Liu De Hua music interface through manual operation modes such as remote control, and inputs a second voice instruction: when the user learns friends, the execution result of the television terminal on the second voice instruction is as follows: playing songs of Zhang schoolfriends; 4. the user switches from the song interface of Liudebua to other singer interfaces through manual operation modes such as remote control and the like, and inputs a third voice instruction: when the song is the singing song, the television terminal executes the third voice instruction as follows: liu De Hua; 5. during the song playing, the user inputs a fourth voice instruction: "carry the head of the Tang poetry" and then exit the Tang poetry interface. When the user inputs a fifth voice instruction: when the user changes one voice, the execution result of the television terminal on the fifth voice instruction is as follows: trade a Tang poem.
Referring to FIG. 72, the interaction process shown in this figure is as follows: 1. the user inputs a first voice instruction: "songs by Liudebua"; 2. the television terminal opens a music playing interface and plays the songs of Liudebua; 3. the user closes the Liu De Hua music interface through manual operation modes such as remote control, and inputs a second voice instruction: when the user learns friends, the execution result of the television terminal on the second voice instruction is as follows: uniformly searching Zhang schoolmate information; 4. the user switches from the song interface of Liudebua to other singer interfaces through manual operation modes such as remote control and the like, and inputs a third voice instruction: when the song is the singing song, the television terminal executes the third voice instruction as follows: after listening to songs and identifying songs, returning the name of the singer; 5. during the song playing, the user inputs a fourth voice instruction: "carry the head of the Tang poetry" and then exit the Tang poetry interface. When the user inputs a fifth voice instruction: when the user changes one voice, the execution result of the television terminal on the fifth voice instruction is as follows: and changing a song.
As can be seen from the two interaction processes and the interaction results shown in fig. 71 and 72, when the user intervenes in the interaction manners such as remote control operation among multiple rounds of voice commands, the application scene of the television terminal has actually changed, and the existing context determination manner still stays in the multiple rounds of interaction scenarios, so that the execution result of the television terminal has a larger deviation from the actual intention of the user. Through the voice information identified by the context judgment method based on scene perception provided by the application, when the application scene of the television terminal changes, a plurality of rounds of context interaction scenes can be jumped out at the right moment, second voice instructions such as 'scholar friend', 'who sings the song' and 'change one' are independently analyzed, and the analyzed semantic information is returned to the equipment end, so that the user instruction is correctly executed.
In addition, based on the method embodiment, the application also provides a context judgment device based on scene perception, and the context judgment device is applied to a cloud server side. Please refer to fig. 8, which is a schematic structural diagram of a context determination device based on scene awareness according to an embodiment of the present application. As can be seen from fig. 8, the apparatus comprises:
the first parsing module 100 is configured to parse a first voice instruction sent by the device end, and return the parsed first semantic information to the device end, where the first voice instruction does not have a context with a previous voice instruction.
The first detection module 200 is configured to detect whether a second voice instruction sent by a device end has a context relationship with the first voice instruction, where a receiving time of the second voice instruction is later than a receiving time of the first voice instruction.
A second detecting module 300, configured to detect whether the first voice instruction and the second voice instruction are in the same application scenario when the second voice instruction has a context relationship with the first voice instruction.
And a second parsing module 400, configured to parse the second voice instruction through a context relationship when the first voice instruction and the second voice instruction are in the same application scenario, and return the parsed second semantic information to the device side.
And a third parsing module 500, configured to, when the first voice instruction and the second voice instruction are in different application scenarios, parse the second voice instruction separately, and return third semantic information obtained through parsing to the device side.
In addition, based on the above method embodiment, the present application further provides a context determination device based on scene awareness, which is applied to an equipment side. Please refer to fig. 9, which is a schematic structural diagram of a context determining apparatus based on scene awareness according to an embodiment of the present application. As can be seen from fig. 9, the apparatus comprises:
the first obtaining module 10 is configured to obtain a first voice instruction sent by a user, and send the first voice instruction and application scenario information corresponding to the first voice instruction to a cloud server;
the first execution module 20 is configured to execute the first voice instruction according to the first semantic information returned by the cloud server;
the second obtaining module 30 is configured to obtain a second voice instruction sent by a user, and send the second voice instruction and application scenario information corresponding to the second voice instruction to a cloud server;
the second execution module 40 is configured to execute the second voice instruction according to second semantic information or third semantic information returned by the cloud server, where the cloud server returns the second semantic information when the first voice instruction and the second voice instruction are in the same application scenario, and the cloud server returns the third semantic information when the first voice instruction and the second voice instruction are in different application scenarios.
In addition, an embodiment of the present application further provides a context determination system based on scene awareness, and please refer to fig. 10, which is a schematic structural diagram of the context determination system based on scene awareness provided in the embodiment of the present application. As shown in fig. 10, the system includes a television terminal 1 and a cloud server 2, where the television terminal includes a context determining device 11 applied to an equipment side in the foregoing embodiment, and a parsing result of semantics by the context determining device may be presented in the form of an intelligent interactive client. The television terminal is used for acquiring a voice instruction sent by a user and scene information corresponding to the voice instruction.
The cloud server comprises a context judgment device 21 applied to the cloud server side in the above embodiment, and the cloud server is configured to analyze semantic information of the voice instruction according to the voice instruction and the scene information sent by the television terminal, so that the television terminal executes the voice instruction according to the semantic information. In addition, the system further includes an instruction input device 3, where the instruction input device is used to input a scene switching instruction, and in this embodiment, the instruction input device is a remote controller.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The foregoing is merely a detailed description of the invention, and it should be noted that modifications and adaptations by those skilled in the art may be made without departing from the principles of the invention, and should be considered as within the scope of the invention.

Claims (8)

1. A speech recognition method based on scene perception is characterized by comprising the following steps:
analyzing a first voice instruction sent by an equipment end, and returning first semantic information obtained by analysis to the equipment end so that the equipment end can execute the first voice instruction according to the first semantic information;
detecting whether a second voice instruction sent by a device end has a context relationship with the first voice instruction or not, wherein the receiving time of the second voice instruction is later than that of the first voice instruction;
when the second voice instruction and the first voice instruction have a context relationship, detecting whether the first voice instruction and the second voice instruction are in the same application scene;
if the first voice instruction and the second voice instruction are in the same application scene, analyzing the second voice instruction through context, and returning second semantic information obtained through analysis to the equipment end;
and if the first voice instruction and the second voice instruction are in different application scenes, independently analyzing the second voice instruction, and returning third semantic information obtained by analysis to the equipment end so that the equipment end can execute the second voice instruction according to the second semantic information or the third semantic information.
2. The method according to claim 1, wherein the detecting whether the second voice instruction sent by the device side has a context relationship with the first voice instruction comprises:
receiving first application scene information and second application scene information sent by a device end, wherein the application scene information comprises an application program name, a display interface name and an interface name of an interactive client end, which are operated by the device end when a voice instruction is received;
when the scene database does not contain the first application scene information and the second application scene information at the same time, confirming that the second voice instruction does not have a contextual relation with the first voice instruction;
when the scene database simultaneously contains the first application scene information and the second application scene information, confirming the context relationship between the second voice instruction and the first voice instruction according to a pre-constructed knowledge base, wherein the knowledge base comprises a plurality of service modules, and the service modules comprise a plurality of semantic slots of service dimension information.
3. The method of claim 2, wherein detecting whether the first voice instruction and the second voice instruction are in the same application scenario comprises:
when the device end receives the second voice instruction and the application program operated by the device end is third-party software, detecting whether the application program name in the first application scene information is the same as the application program name in the second application scene information;
if the first voice command and the second voice command are the same, confirming that the first voice command and the second voice command are in the same application scene;
if not, confirming that the first voice instruction and the second voice instruction are in different application scenes;
when the equipment end receives a second voice instruction and the application program operated by the equipment end is not third-party software, detecting whether the display interface name in the first application scene information is the same as the display interface name in the second application scene information;
if the first voice command and the second voice command are the same, confirming that the first voice command and the second voice command are in the same application scene;
and if the first voice command and the second voice command are different, confirming that the first voice command and the second voice command are in different application scenes.
4. The method according to claim 2, wherein the parsing the first voice command sent by the device side comprises:
extracting a first keyword in the first voice instruction;
searching a first service module corresponding to the first voice instruction according to the first keyword;
dividing the service dimension of the first keyword according to the semantic slot corresponding to the first service module;
and analyzing first semantic information of the first voice command according to the service dimension of the first keyword.
5. The method according to claim 4, wherein the detecting whether the second voice instruction sent by the device side has a context relationship with the first voice instruction comprises:
extracting a second keyword in the second voice instruction;
and when the second keyword and the first keyword have the same service dimension, determining whether the second voice instruction and the first voice instruction have a context relationship.
6. A context determination apparatus based on scene awareness, comprising:
the first analysis module is used for analyzing a first voice instruction sent by the equipment end and returning first semantic information obtained by analysis to the equipment end so that the equipment end can execute the first voice instruction according to the first semantic information;
the first detection module is used for detecting whether a second voice instruction sent by a device end has a context relationship with the first voice instruction or not, and the receiving time of the second voice instruction is later than that of the first voice instruction;
the second detection module is used for detecting whether the first voice instruction and the second voice instruction are in the same application scene or not when the second voice instruction and the first voice instruction have a context relationship;
the second analysis module is used for analyzing a second voice instruction through a context relationship when the first voice instruction and the second voice instruction are in the same application scene, and returning second semantic information obtained through analysis to the equipment terminal;
and the third analysis module is used for independently analyzing the second voice instruction when the first voice instruction and the second voice instruction are in different application scenes, and returning third semantic information obtained by analysis to the equipment terminal so that the equipment terminal can execute the second voice instruction according to the second semantic information or the third semantic information.
7. A context judgment system based on scene perception is characterized by comprising a television terminal and a cloud server, wherein,
the cloud server comprises the context judgment device based on scene perception as claimed in claim 6, and is configured to parse semantic information of the voice instruction according to the voice instruction and the scene information sent by the television terminal, so that the television terminal executes the voice instruction according to the semantic information.
8. The context determination system based on scene awareness according to claim 7, further comprising an instruction input device for inputting a scene switching instruction.
CN201810646326.XA 2018-06-21 2018-06-21 Context judgment method, device and system based on scene perception Active CN110634477B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810646326.XA CN110634477B (en) 2018-06-21 2018-06-21 Context judgment method, device and system based on scene perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810646326.XA CN110634477B (en) 2018-06-21 2018-06-21 Context judgment method, device and system based on scene perception

Publications (2)

Publication Number Publication Date
CN110634477A true CN110634477A (en) 2019-12-31
CN110634477B CN110634477B (en) 2022-01-25

Family

ID=68966455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810646326.XA Active CN110634477B (en) 2018-06-21 2018-06-21 Context judgment method, device and system based on scene perception

Country Status (1)

Country Link
CN (1) CN110634477B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111367407A (en) * 2020-02-24 2020-07-03 Oppo(重庆)智能科技有限公司 Intelligent glasses interaction method, intelligent glasses interaction device and intelligent glasses
CN113806503A (en) * 2021-08-25 2021-12-17 北京库睿科技有限公司 Dialog fusion method, device and equipment
CN115064167A (en) * 2022-08-17 2022-09-16 广州小鹏汽车科技有限公司 Voice interaction method, server and storage medium
CN117059074A (en) * 2023-10-08 2023-11-14 四川蜀天信息技术有限公司 Voice interaction method and device based on intention recognition and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140025380A1 (en) * 2012-07-18 2014-01-23 International Business Machines Corporation System, method and program product for providing automatic speech recognition (asr) in a shared resource environment
CN104811777A (en) * 2014-01-23 2015-07-29 阿里巴巴集团控股有限公司 Smart television voice processing method, smart television voice processing system and smart television
CN106792047A (en) * 2016-12-20 2017-05-31 Tcl集团股份有限公司 The sound control method and system of a kind of intelligent television
US20180004729A1 (en) * 2016-06-29 2018-01-04 Shenzhen Gowild Robotics Co., Ltd. State machine based context-sensitive system for managing multi-round dialog
CN108022590A (en) * 2016-11-03 2018-05-11 谷歌有限责任公司 Focusing session at speech interface equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140025380A1 (en) * 2012-07-18 2014-01-23 International Business Machines Corporation System, method and program product for providing automatic speech recognition (asr) in a shared resource environment
CN104811777A (en) * 2014-01-23 2015-07-29 阿里巴巴集团控股有限公司 Smart television voice processing method, smart television voice processing system and smart television
US20180004729A1 (en) * 2016-06-29 2018-01-04 Shenzhen Gowild Robotics Co., Ltd. State machine based context-sensitive system for managing multi-round dialog
CN108022590A (en) * 2016-11-03 2018-05-11 谷歌有限责任公司 Focusing session at speech interface equipment
CN106792047A (en) * 2016-12-20 2017-05-31 Tcl集团股份有限公司 The sound control method and system of a kind of intelligent television

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111367407A (en) * 2020-02-24 2020-07-03 Oppo(重庆)智能科技有限公司 Intelligent glasses interaction method, intelligent glasses interaction device and intelligent glasses
CN111367407B (en) * 2020-02-24 2023-10-10 Oppo(重庆)智能科技有限公司 Intelligent glasses interaction method, intelligent glasses interaction device and intelligent glasses
CN113806503A (en) * 2021-08-25 2021-12-17 北京库睿科技有限公司 Dialog fusion method, device and equipment
CN115064167A (en) * 2022-08-17 2022-09-16 广州小鹏汽车科技有限公司 Voice interaction method, server and storage medium
CN115064167B (en) * 2022-08-17 2022-12-13 广州小鹏汽车科技有限公司 Voice interaction method, server and storage medium
CN117059074A (en) * 2023-10-08 2023-11-14 四川蜀天信息技术有限公司 Voice interaction method and device based on intention recognition and storage medium
CN117059074B (en) * 2023-10-08 2024-01-19 四川蜀天信息技术有限公司 Voice interaction method and device based on intention recognition and storage medium

Also Published As

Publication number Publication date
CN110634477B (en) 2022-01-25

Similar Documents

Publication Publication Date Title
CN110634477B (en) Context judgment method, device and system based on scene perception
JP7335062B2 (en) Voice service providing method and apparatus
US10922355B2 (en) Method and apparatus for recommending news
CN110209843B (en) Multimedia resource playing method, device, equipment and storage medium
CN106098063B (en) Voice control method, terminal device and server
WO2016150083A1 (en) Information input method and apparatus
CN111522909B (en) Voice interaction method and server
CN110557659B (en) Video recommendation method and device, server and storage medium
CN109165302A (en) Multimedia file recommendation method and device
CN107785018A (en) More wheel interaction semantics understanding methods and device
CN113569037A (en) Message processing method and device and readable storage medium
CN109783656B (en) Recommendation method and system of audio and video data, server and storage medium
CN109670020B (en) Voice interaction method, system and device
CN108958503A (en) input method and device
US11756544B2 (en) Selectively providing enhanced clarification prompts in automated assistant interactions
CN110708607A (en) Live broadcast interaction method and device, electronic equipment and storage medium
CN110866200A (en) Service interface rendering method and device
CN111428512A (en) Semantic recognition method, device and equipment
CN111970525B (en) Live broadcast room searching method and device, server and storage medium
CN109275047A (en) Video information processing method and device, electronic equipment, storage medium
CN111105294A (en) VR navigation method, system, client, server and storage medium thereof
CN113132214B (en) Dialogue method, dialogue device, dialogue server and dialogue storage medium
JP2023506087A (en) Voice Wakeup Method and Apparatus for Skills
CN105404681A (en) Live broadcast sentiment classification method and apparatus
CN111954017B (en) Live broadcast room searching method and device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant