CN110634477A - Context judgment method, device and system based on scene perception - Google Patents
Context judgment method, device and system based on scene perception Download PDFInfo
- Publication number
- CN110634477A CN110634477A CN201810646326.XA CN201810646326A CN110634477A CN 110634477 A CN110634477 A CN 110634477A CN 201810646326 A CN201810646326 A CN 201810646326A CN 110634477 A CN110634477 A CN 110634477A
- Authority
- CN
- China
- Prior art keywords
- voice instruction
- voice
- scene
- context
- instruction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 56
- 230000008447 perception Effects 0.000 title claims abstract description 12
- 230000002452 interceptive effect Effects 0.000 claims description 11
- 238000001514 detection method Methods 0.000 claims description 5
- 230000003993 interaction Effects 0.000 abstract description 22
- 230000008859 change Effects 0.000 abstract description 9
- 230000008569 process Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 5
- 230000006978 adaptation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/472—End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Databases & Information Systems (AREA)
- Signal Processing (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The embodiment of the invention provides a context judgment method, a context judgment device and a context judgment system based on scene perception. The context relation of the voice command and the change of the application scene are jointly used as the selection basis of the voice command analysis mode, the limitation of context interaction can be better broken through, and the accuracy of voice command analysis is improved in the scene that the application scene changes.
Description
Technical Field
The present disclosure relates to the field of smart television technologies, and in particular, to a context determination method, device, and system based on scene awareness.
Background
With the rapid development of artificial intelligence, the combination of television services, intelligent voice and semantic understanding is more and more deep, and users can realize more convenient and faster service search through voice instructions.
In the application scenario of multi-round interaction, the continuously input voice commands usually have certain relevance, and the semantics of the subsequent commands can be deduced through the prior commands and the context. The existing context semantic understanding is mainly based on semantic rules, namely, a command which is input last time by a user is used for predicting a command which is likely to appear next time, and if the commands received twice meet preset grammar rules, the command received next time can be analyzed through context; if the preset grammar rule is not met, the fact that the received instructions do not have correlation is shown, and at the moment, the meaning of the next instruction needs to be independently analyzed. For example, the user enters a voice command: after the weather of the Qingdao is similar, the weather of the tomorrow is input, the two instructions have correlation, and the actual meaning of the second instruction of the user is easily analyzed through context, namely the weather of the tomorrow of the Qingdao.
The interactive process of the user and the television terminal is random, the searching function of the voice instruction under most scenes can realize direct speech, but the phenomenon that the execution result of the television terminal is inconsistent with the instruction intention of the user can occur under some scenes.
Disclosure of Invention
The embodiment of the invention provides a context judgment method, a context judgment device and a context judgment system based on scene perception, and aims to solve the problem that in the prior art, the context perception and the context prediction accuracy rate are easily reduced only according to context information input by a user voice.
In a first aspect, the present invention provides a speech recognition method based on scene awareness, which is characterized by comprising:
analyzing a first voice instruction sent by an equipment end, and returning first semantic information obtained by analysis to the equipment end so that the equipment end can execute the first voice instruction according to the first semantic information;
detecting whether a second voice instruction sent by a device end has a context relationship with the first voice instruction or not, wherein the receiving time of the second voice instruction is later than that of the first voice instruction;
when the second voice instruction and the first voice instruction have a context relationship, detecting whether the first voice instruction and the second voice instruction are in the same application scene;
if the first voice instruction and the second voice instruction are in the same application scene, analyzing the second voice instruction through context, and returning second semantic information obtained through analysis to the equipment end;
and if the first voice instruction and the second voice instruction are in different application scenes, independently analyzing the second voice instruction, and returning third semantic information obtained by analysis to the equipment end so that the equipment end can execute the second voice instruction according to the second semantic information or the third semantic information.
In a second aspect, the present invention provides a context determination apparatus based on scene awareness, including:
the first analysis module is used for analyzing a first voice instruction sent by the equipment end and returning first semantic information obtained by analysis to the equipment end so that the equipment end can execute the first voice instruction according to the first semantic information;
the first detection module is used for detecting whether a second voice instruction sent by a device end has a context relationship with the first voice instruction or not, and the receiving time of the second voice instruction is later than that of the first voice instruction;
the second detection module is used for detecting whether the first voice instruction and the second voice instruction are in the same application scene or not when the second voice instruction and the first voice instruction have a context relationship;
the second analysis module is used for analyzing a second voice instruction through a context relationship when the first voice instruction and the second voice instruction are in the same application scene, and returning second semantic information obtained through analysis to the equipment terminal;
and the third analysis module is used for independently analyzing the second voice instruction when the first voice instruction and the second voice instruction are in different application scenes, and returning third semantic information obtained by analysis to the equipment terminal so that the equipment terminal can execute the second voice instruction according to the second semantic information or the third semantic information.
In a third aspect, the invention provides a context judgment system based on scene awareness, comprising a television terminal and a cloud server, wherein,
the cloud server comprises the context judgment device based on scene perception, and is used for analyzing semantic information of the voice instruction according to the voice instruction and the scene information sent by the television terminal, so that the television terminal can execute the voice instruction according to the semantic information.
The beneficial effect of this application is as follows:
the embodiment of the invention provides a context judgment method, a context judgment device and a context judgment system based on scene perception. The context relation of the voice command and the change of the application scene are jointly used as the selection basis of the voice command analysis mode, the limitation of context interaction can be better broken through, and the accuracy of voice command analysis is improved in the scene that the application scene changes.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a flowchart of a method for context determination based on scene awareness according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a method of step S210 according to an embodiment of the present disclosure;
fig. 3 is a flowchart of a method of step S310 according to an embodiment of the present disclosure;
fig. 4 is a flowchart of a method of step S110 according to an embodiment of the present disclosure;
fig. 5 is a flowchart of a method of step S214 according to an embodiment of the present disclosure;
fig. 6 is a flowchart of another context determination method based on scene awareness according to an embodiment of the present application;
FIG. 71 is a diagram illustrating an interaction result of a conventional context determination method;
fig. 72 is a diagram of an interaction result of a context determination method based on scene awareness according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a context determining apparatus based on scene awareness according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a context determining apparatus based on scene sensing according to an embodiment of the present disclosure;
fig. 10 is a schematic structural diagram of a context determination system based on scene awareness according to an embodiment of the present application.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The interactive process of the user and the television terminal is random, the search function of the voice command in most scenes can realize direct speech, but the voice operation is slightly complicated in some scenes, for example, the user can close or quit a certain interface by directly pressing a return key, the user needs to press the voice key in comparison, and then inputs the voice 'quitting' or 'returning' voice command, the interactive mode through the physical key is quicker and more convenient, so that the user can switch the operation mode according to personal intention in the voice interactive process. Based on the above scenario, if only the context information input by the user voice is considered and other interactive operations are ignored for context sensing and prediction, the accuracy of context sensing and context prediction is reduced, which results in a phenomenon that the execution result of the television terminal is inconsistent with the instruction intention of the user.
Aiming at the problem that the context sensing and the accuracy rate of context prediction are easily reduced only according to the context information input by the voice of a user in the prior art, the application provides a context judgment method, a device and a system based on scene sensing, and the core idea is as follows: determining that a context relationship exists between a second voice instruction received later and a first voice instruction received earlier, and then detecting whether the two instructions are in the same application scene, if the two application scenes are not changed, the context relationship between the two voice instructions is valid, the cloud server analyzes the second voice instruction by using the context relationship, if the two application scenes are changed, the context relationship between the two voice instructions is invalid, and when the cloud server analyzes the second voice instruction, the cloud server does not combine the context relationship with the first voice instruction and independently analyzes the second voice instruction. The context relation of the voice command and the change of the application scene are jointly used as the selection basis of the voice command analysis mode, the limitation of context interaction can be better broken through, and the accuracy of voice command analysis is improved in the scene that the application scene changes. The embodiments provided in the present application will be described in detail with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of a method for context determination based on scene awareness according to an embodiment of the present application is shown. The method is applied to a cloud server side, the cloud server side is relative to an equipment side, the cloud server has storage and operation functions, can store voice instruction information and application scene information reported by the equipment side, scene database information and the like, and can perform instruction semantic analysis according to the received voice instruction information and application scene information, wherein the voice instruction information and the application scene information are text information generated by television equipment generally. Specifically, as can be seen from fig. 1, the method comprises the following steps:
step S110: and analyzing the first voice command sent by the equipment terminal, and returning the analyzed first semantic information to the equipment terminal so that the equipment terminal executes the first voice command according to the first semantic information.
In an actual scene of man-machine interaction, a user and a television terminal usually have a multi-round interactive application scene, in the multi-round interaction, the user can send a plurality of voice instructions, certain relevance exists among the voice instructions, namely, the voice instructions have context, and the semantics of the voice instructions sent later can be inferred through the voice instructions sent earlier. In this embodiment, the first voice command is a first voice command of a multi-round interaction.
Step S210: detecting whether a second voice instruction sent by a device end has a context relationship with the first voice instruction, if so, executing the step S310, wherein the receiving time of the second voice instruction is later than that of the first voice instruction.
Referring to fig. 2, a flowchart of a method of step S210 according to an embodiment of the present disclosure is shown. As can be seen from fig. 2, step S210 includes the following steps:
step S211: and receiving the first application scene information and the second application scene information sent by the equipment terminal.
The first application scenario information comprises an application program name, a display interface name and an interface name of the interactive client, wherein the application program name, the display interface name and the interface name are operated by the equipment terminal when a first voice instruction is received, and the second application scenario information comprises the application program name, the display interface name and the interface name of the interactive client, wherein the application program name, the display interface name and the interface name are operated by the equipment terminal when a second voice instruction is received. In this embodiment, the device side sends the first voice instruction and the first application scenario information at the same time, sends the second voice instruction and the second application scenario information at the same time, and the cloud server stores the received first application scenario information and the received second application scenario information in the memory, so as to perform the subsequent scenario switching and the subsequent context judgment. Of course, in other embodiments of the present application, the device side may also send the first application scenario information and the second application scenario information to the cloud server at the same time.
Step S212: detecting whether a preset scene database simultaneously contains the first application scene information and the second application scene information.
In this embodiment, some application scenarios cannot support application for parsing the semantic information of the voice instruction through the context, so that a preliminary judgment is made on the possibility of parsing through the context through a scene database pre-stored in the cloud server in this embodiment. The scene database includes multiple application scenes capable of performing context analysis, each scene includes corresponding scene information, the scene information includes an application program name (i.e., a package name of a package) operated by the device side, a display interface name, and an interface name of the interactive client, and a specific setting form of the scene database refers to table 1.
Table 1: storage form table of scene database
Bag name (packagemame) | Display interface name (classname) | Interface name (scene name) |
com.tencent.qqmusic | com.tencent.qqmusic.playview | Song playing scene |
com.tencent.qqmusic | com.tencent.qqmusic.listview | Song list scenario |
com.xiaomi.videochat | com.hisense.videochat.contactview | Video call contact scene |
com.xiaomi.videochat | com.hisense.videochat.talkview | Video call scenario |
…… | …… | …… |
When the scene database does not contain the first application scene information and the second application scene information at the same time, step S213 is executed, that is, it indicates that the application program executed by one or both of the first voice instruction and the second voice instruction cannot analyze semantics through context, and the application program and the second voice instruction do not have context relationship, and need to jump out multiple rounds of interaction to analyze the second voice instruction separately; when the scene database contains the first application scene information and the second application scene information, step S214 is executed, that is, it indicates that the application program executed by the first voice instruction and the second voice instruction can analyze the semantics through the context, the two have the context relationship, and the semantic information of the second voice instruction can be analyzed through the pre-established knowledge base.
Step S213: confirming that the second voice instruction does not have a contextual relationship with the first voice instruction.
Step S214: and confirming the context relationship between the second voice instruction and the first voice instruction according to a pre-constructed knowledge base. The knowledge base comprises a plurality of business modules, and the business modules comprise a plurality of semantic slots of business dimension information.
Step S310: detecting whether the first voice instruction and the second voice instruction are in the same application scene.
If the first voice instruction and the second voice instruction are in the same application scene, it is indicated that the context relationship between the two voice instructions is valid, the cloud server continues to use the context relationship to analyze the second voice instruction, and step S410 is executed;
if the first voice instruction and the second voice instruction are in different application scenes, it indicates that the context relationship between the two previous and next voice instructions is invalid, and when the cloud server analyzes the second voice instruction, the cloud server does not combine the context relationship with the first voice instruction, but independently analyzes the second voice instruction, and step S510 is executed. In this embodiment, the factor causing the application scene change may be a voice instruction or an operation instruction sent by a user through a remote controller.
In this embodiment, the criterion for determining whether the first voice command and the second voice command are in the same application scenario is related to an attribute of the application program, if the software running at the corresponding time is third-party software (such as QQ, WeChat, microblog, and the like), whether the application scenario changes may be determined by an application name (i.e., a package name), and when the application name does not change, if the application names corresponding to the first voice command and the second voice command are both com. If the software running at the corresponding moment is built-in software (such as searching services of nearby food, hospitals and the like), the change condition of the application scene can be judged through the change of the display interface name. Whether the running software is the third-party software can be detected through application scene information reported by the equipment terminal, and if the name of the application program in the application scene information is null, the software is the built-in software of the local machine.
Specifically, please refer to fig. 3, which illustrates a flowchart of a method of step S310 according to an embodiment of the present application. As can be seen from fig. 3, step S310 includes the following steps:
step S311: detecting whether the application program operated by the equipment end is third-party software or not when the equipment end receives the second voice instruction; if the device end receives the second voice instruction, the application program run by the device end is third-party software, step S312 is executed, and if the device end receives the second voice instruction, the application program run by the device end is not third-party software, step S315 is executed.
Step S312: detecting whether the application program name in the first application scene information is the same as the application program name in the second application scene information; if the two signals are the same, step S313 is executed, and if the two signals are not the same, step S314 is executed.
Step S313: confirming that the first voice instruction and the second voice instruction are in the same application scene;
step S314: confirming that the first voice instruction and the second voice instruction are in different application scenes;
step S315: detecting whether the display interface name in the first application scene information is the same as the display interface name in the second application scene information; if the two signals are the same, step S313 is executed, and if the two signals are not the same, step S314 is executed.
Step S410: and analyzing the second voice command through the context relationship, and returning the analyzed second semantic information to the equipment terminal.
Step S510: and independently analyzing the second voice command, and returning third semantic information obtained by analysis to the equipment terminal so that the equipment terminal executes the second voice command according to the second semantic information or the third semantic information.
Referring to fig. 4, a flowchart of a method of step S110 according to an embodiment of the present disclosure is shown. As shown in fig. 4, in other embodiments of the present application, step S110 may further include the following steps:
step S111: extracting a first keyword in the first voice instruction;
step S112: searching a first service module corresponding to the first voice instruction according to the first keyword;
step S113: dividing the service dimension of the first keyword according to the semantic slot corresponding to the first service module;
step S114: and analyzing first semantic information of the first voice command according to the service dimension of the first keyword.
Referring to fig. 5, a flowchart of a method of step S214 according to an embodiment of the present disclosure is shown. As shown in fig. 5, in other embodiments of the present application, step S214 may further include the following steps:
step S2141: extracting a second keyword in the second voice instruction;
step S2142: and when the second keyword and the first keyword have the same service dimension, determining whether the second voice instruction and the first voice instruction have a context relationship.
The knowledge base is provided with a plurality of business modules such as videos, music, television control, application, shopping, ticketing, gourmet and stocks, business dimension information is abstracted according to the field characteristics of each business, data is stored in the knowledge base according to different business dimension information, and the storage mode of the knowledge base adopts the following steps: databases, knowledge maps, and the like. For another example: the movie service can support the query dimensions including: the movie name, the type, the year, the region, the language, the award, the actor, the director and other service dimension information.
The method for constructing the knowledge base is briefly described below by taking a weather service as an example. The query dimension information of the weather service comprises time, place, weather phenomenon, air quality and weather index. Wherein, the time dimension can be relative time (such as today, tomorrow, next week, etc.); absolute time (e.g., 3/8, spring festival, etc.); the location dimension may include specific location information, such as: qingdao, sunshine, constant water, etc.; weather phenomenon dimensions may include sunny days, cloudy days, rain, temperature, etc.; the air mass dimension may include air mass, PM2.5, etc.; the weather index dimension may include a dressing index, a sports index, a car wash index, and the like.
In this embodiment, the semantic slot is used as a knowledge representation of the dimension information of the service query, and the natural language understanding process is a process of analyzing the user input into the predefined semantic slot. Taking a weather service as an example, the dimension information in the semantic slot may include natural language understanding such as service classification, service target attribute, time, location, weather keyword, weather phenomenon word, air quality, and air index, and the process of parsing using the context is essentially a process of parsing the user input into a predefined semantic slot. The following describes the context resolution process based on an example. For example, the user enters: "weather in Qingdao", user input below: the specific analysis process of the Beijing woolen cloth is as follows:
firstly, aiming at the semantic understanding of the 'weather of the Qingdao', filling information of the points in the filling groove into the Qingdao;
when the user enters the following input: "beijing woolen" alone analyzes the single-round input and cannot know the user's intention. And positioning the weather service in combination with the analysis of the above, wherein the Beijing is the query dimension of the weather service and the information with the same dimension as the Qingdao, and at the moment, replacing the semantic slot, namely filling the information of the place in the slot into the Beijing.
The location service at this time is a weather service, and the filling of the slot value is completed. Assuming that the user inputs "tomorrow," the same analysis process, except that it is now supplemented with a date slot value, completes the detailed analysis of the context.
Corresponding to the context judgment method based on the cloud server side, the embodiment of the application also provides a context judgment method applied to the equipment side. In this embodiment, the device side is a television terminal having a voice recognition function and a remote control function, the television terminal can receive a voice instruction and a remote control instruction sent by a user, and acquire application scene information of the terminal device at different times, before executing the voice instruction, the received voice instruction and the application scene information corresponding to the voice instruction can be sent to the cloud server together, the cloud server executes the context judgment method in the above embodiment, the received voice instruction can be analyzed into a semantic instruction, and then the semantic instruction is returned to the television terminal, and the television terminal executes a related operation according to the semantic instruction.
Referring to fig. 6, a flowchart of another context determination method based on scene awareness according to an embodiment of the present application is shown. As can be seen from fig. 6, the method comprises the following steps:
step S120: the method comprises the steps of obtaining a first voice instruction sent by a user, and sending the first voice instruction and application scene information corresponding to the first voice instruction to a cloud server side.
Step S220: and executing the first voice instruction according to the first semantic information returned by the cloud server.
Step S320: and acquiring a second voice instruction sent by a user, and sending the second voice instruction and application scene information corresponding to the second voice instruction to a cloud server side.
Step S420: and executing the second voice instruction according to the second semantic information or the third semantic information returned by the cloud server. When the first voice instruction and the second voice instruction are in the same application scene, the cloud server end returns second semantic information, and when the first voice instruction and the second voice instruction are in different application scenes, the cloud server end returns third semantic information. According to the technical scheme, the context relation of the voice command and the change of the application scene are jointly used as the selection basis of the voice command analysis mode, the limitation of context interaction can be better broken through, and the accuracy of voice command analysis is favorably improved in the scene that the application scene changes.
Please refer to fig. 71 and 72, which are an interaction result diagram of a conventional context determination method and an interaction result diagram of a context determination method based on scene sensing according to an embodiment of the present application. As can be seen from comparison between fig. 71 and 72, the human-computer interaction processes in fig. 71 and 72 are consistent, but after multiple rounds of human-computer interaction, the instruction execution results fed back in the last round have a large difference.
Referring to FIG. 71, it can be seen that the interaction process shown in this figure is as follows: 1. the user inputs a first voice instruction: "songs by Liudebua"; 2. the television terminal opens a music playing interface and plays the songs of Liudebua; 3. the user closes the Liu De Hua music interface through manual operation modes such as remote control, and inputs a second voice instruction: when the user learns friends, the execution result of the television terminal on the second voice instruction is as follows: playing songs of Zhang schoolfriends; 4. the user switches from the song interface of Liudebua to other singer interfaces through manual operation modes such as remote control and the like, and inputs a third voice instruction: when the song is the singing song, the television terminal executes the third voice instruction as follows: liu De Hua; 5. during the song playing, the user inputs a fourth voice instruction: "carry the head of the Tang poetry" and then exit the Tang poetry interface. When the user inputs a fifth voice instruction: when the user changes one voice, the execution result of the television terminal on the fifth voice instruction is as follows: trade a Tang poem.
Referring to FIG. 72, the interaction process shown in this figure is as follows: 1. the user inputs a first voice instruction: "songs by Liudebua"; 2. the television terminal opens a music playing interface and plays the songs of Liudebua; 3. the user closes the Liu De Hua music interface through manual operation modes such as remote control, and inputs a second voice instruction: when the user learns friends, the execution result of the television terminal on the second voice instruction is as follows: uniformly searching Zhang schoolmate information; 4. the user switches from the song interface of Liudebua to other singer interfaces through manual operation modes such as remote control and the like, and inputs a third voice instruction: when the song is the singing song, the television terminal executes the third voice instruction as follows: after listening to songs and identifying songs, returning the name of the singer; 5. during the song playing, the user inputs a fourth voice instruction: "carry the head of the Tang poetry" and then exit the Tang poetry interface. When the user inputs a fifth voice instruction: when the user changes one voice, the execution result of the television terminal on the fifth voice instruction is as follows: and changing a song.
As can be seen from the two interaction processes and the interaction results shown in fig. 71 and 72, when the user intervenes in the interaction manners such as remote control operation among multiple rounds of voice commands, the application scene of the television terminal has actually changed, and the existing context determination manner still stays in the multiple rounds of interaction scenarios, so that the execution result of the television terminal has a larger deviation from the actual intention of the user. Through the voice information identified by the context judgment method based on scene perception provided by the application, when the application scene of the television terminal changes, a plurality of rounds of context interaction scenes can be jumped out at the right moment, second voice instructions such as 'scholar friend', 'who sings the song' and 'change one' are independently analyzed, and the analyzed semantic information is returned to the equipment end, so that the user instruction is correctly executed.
In addition, based on the method embodiment, the application also provides a context judgment device based on scene perception, and the context judgment device is applied to a cloud server side. Please refer to fig. 8, which is a schematic structural diagram of a context determination device based on scene awareness according to an embodiment of the present application. As can be seen from fig. 8, the apparatus comprises:
the first parsing module 100 is configured to parse a first voice instruction sent by the device end, and return the parsed first semantic information to the device end, where the first voice instruction does not have a context with a previous voice instruction.
The first detection module 200 is configured to detect whether a second voice instruction sent by a device end has a context relationship with the first voice instruction, where a receiving time of the second voice instruction is later than a receiving time of the first voice instruction.
A second detecting module 300, configured to detect whether the first voice instruction and the second voice instruction are in the same application scenario when the second voice instruction has a context relationship with the first voice instruction.
And a second parsing module 400, configured to parse the second voice instruction through a context relationship when the first voice instruction and the second voice instruction are in the same application scenario, and return the parsed second semantic information to the device side.
And a third parsing module 500, configured to, when the first voice instruction and the second voice instruction are in different application scenarios, parse the second voice instruction separately, and return third semantic information obtained through parsing to the device side.
In addition, based on the above method embodiment, the present application further provides a context determination device based on scene awareness, which is applied to an equipment side. Please refer to fig. 9, which is a schematic structural diagram of a context determining apparatus based on scene awareness according to an embodiment of the present application. As can be seen from fig. 9, the apparatus comprises:
the first obtaining module 10 is configured to obtain a first voice instruction sent by a user, and send the first voice instruction and application scenario information corresponding to the first voice instruction to a cloud server;
the first execution module 20 is configured to execute the first voice instruction according to the first semantic information returned by the cloud server;
the second obtaining module 30 is configured to obtain a second voice instruction sent by a user, and send the second voice instruction and application scenario information corresponding to the second voice instruction to a cloud server;
the second execution module 40 is configured to execute the second voice instruction according to second semantic information or third semantic information returned by the cloud server, where the cloud server returns the second semantic information when the first voice instruction and the second voice instruction are in the same application scenario, and the cloud server returns the third semantic information when the first voice instruction and the second voice instruction are in different application scenarios.
In addition, an embodiment of the present application further provides a context determination system based on scene awareness, and please refer to fig. 10, which is a schematic structural diagram of the context determination system based on scene awareness provided in the embodiment of the present application. As shown in fig. 10, the system includes a television terminal 1 and a cloud server 2, where the television terminal includes a context determining device 11 applied to an equipment side in the foregoing embodiment, and a parsing result of semantics by the context determining device may be presented in the form of an intelligent interactive client. The television terminal is used for acquiring a voice instruction sent by a user and scene information corresponding to the voice instruction.
The cloud server comprises a context judgment device 21 applied to the cloud server side in the above embodiment, and the cloud server is configured to analyze semantic information of the voice instruction according to the voice instruction and the scene information sent by the television terminal, so that the television terminal executes the voice instruction according to the semantic information. In addition, the system further includes an instruction input device 3, where the instruction input device is used to input a scene switching instruction, and in this embodiment, the instruction input device is a remote controller.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The foregoing is merely a detailed description of the invention, and it should be noted that modifications and adaptations by those skilled in the art may be made without departing from the principles of the invention, and should be considered as within the scope of the invention.
Claims (8)
1. A speech recognition method based on scene perception is characterized by comprising the following steps:
analyzing a first voice instruction sent by an equipment end, and returning first semantic information obtained by analysis to the equipment end so that the equipment end can execute the first voice instruction according to the first semantic information;
detecting whether a second voice instruction sent by a device end has a context relationship with the first voice instruction or not, wherein the receiving time of the second voice instruction is later than that of the first voice instruction;
when the second voice instruction and the first voice instruction have a context relationship, detecting whether the first voice instruction and the second voice instruction are in the same application scene;
if the first voice instruction and the second voice instruction are in the same application scene, analyzing the second voice instruction through context, and returning second semantic information obtained through analysis to the equipment end;
and if the first voice instruction and the second voice instruction are in different application scenes, independently analyzing the second voice instruction, and returning third semantic information obtained by analysis to the equipment end so that the equipment end can execute the second voice instruction according to the second semantic information or the third semantic information.
2. The method according to claim 1, wherein the detecting whether the second voice instruction sent by the device side has a context relationship with the first voice instruction comprises:
receiving first application scene information and second application scene information sent by a device end, wherein the application scene information comprises an application program name, a display interface name and an interface name of an interactive client end, which are operated by the device end when a voice instruction is received;
when the scene database does not contain the first application scene information and the second application scene information at the same time, confirming that the second voice instruction does not have a contextual relation with the first voice instruction;
when the scene database simultaneously contains the first application scene information and the second application scene information, confirming the context relationship between the second voice instruction and the first voice instruction according to a pre-constructed knowledge base, wherein the knowledge base comprises a plurality of service modules, and the service modules comprise a plurality of semantic slots of service dimension information.
3. The method of claim 2, wherein detecting whether the first voice instruction and the second voice instruction are in the same application scenario comprises:
when the device end receives the second voice instruction and the application program operated by the device end is third-party software, detecting whether the application program name in the first application scene information is the same as the application program name in the second application scene information;
if the first voice command and the second voice command are the same, confirming that the first voice command and the second voice command are in the same application scene;
if not, confirming that the first voice instruction and the second voice instruction are in different application scenes;
when the equipment end receives a second voice instruction and the application program operated by the equipment end is not third-party software, detecting whether the display interface name in the first application scene information is the same as the display interface name in the second application scene information;
if the first voice command and the second voice command are the same, confirming that the first voice command and the second voice command are in the same application scene;
and if the first voice command and the second voice command are different, confirming that the first voice command and the second voice command are in different application scenes.
4. The method according to claim 2, wherein the parsing the first voice command sent by the device side comprises:
extracting a first keyword in the first voice instruction;
searching a first service module corresponding to the first voice instruction according to the first keyword;
dividing the service dimension of the first keyword according to the semantic slot corresponding to the first service module;
and analyzing first semantic information of the first voice command according to the service dimension of the first keyword.
5. The method according to claim 4, wherein the detecting whether the second voice instruction sent by the device side has a context relationship with the first voice instruction comprises:
extracting a second keyword in the second voice instruction;
and when the second keyword and the first keyword have the same service dimension, determining whether the second voice instruction and the first voice instruction have a context relationship.
6. A context determination apparatus based on scene awareness, comprising:
the first analysis module is used for analyzing a first voice instruction sent by the equipment end and returning first semantic information obtained by analysis to the equipment end so that the equipment end can execute the first voice instruction according to the first semantic information;
the first detection module is used for detecting whether a second voice instruction sent by a device end has a context relationship with the first voice instruction or not, and the receiving time of the second voice instruction is later than that of the first voice instruction;
the second detection module is used for detecting whether the first voice instruction and the second voice instruction are in the same application scene or not when the second voice instruction and the first voice instruction have a context relationship;
the second analysis module is used for analyzing a second voice instruction through a context relationship when the first voice instruction and the second voice instruction are in the same application scene, and returning second semantic information obtained through analysis to the equipment terminal;
and the third analysis module is used for independently analyzing the second voice instruction when the first voice instruction and the second voice instruction are in different application scenes, and returning third semantic information obtained by analysis to the equipment terminal so that the equipment terminal can execute the second voice instruction according to the second semantic information or the third semantic information.
7. A context judgment system based on scene perception is characterized by comprising a television terminal and a cloud server, wherein,
the cloud server comprises the context judgment device based on scene perception as claimed in claim 6, and is configured to parse semantic information of the voice instruction according to the voice instruction and the scene information sent by the television terminal, so that the television terminal executes the voice instruction according to the semantic information.
8. The context determination system based on scene awareness according to claim 7, further comprising an instruction input device for inputting a scene switching instruction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810646326.XA CN110634477B (en) | 2018-06-21 | 2018-06-21 | Context judgment method, device and system based on scene perception |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810646326.XA CN110634477B (en) | 2018-06-21 | 2018-06-21 | Context judgment method, device and system based on scene perception |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110634477A true CN110634477A (en) | 2019-12-31 |
CN110634477B CN110634477B (en) | 2022-01-25 |
Family
ID=68966455
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810646326.XA Active CN110634477B (en) | 2018-06-21 | 2018-06-21 | Context judgment method, device and system based on scene perception |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110634477B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111367407A (en) * | 2020-02-24 | 2020-07-03 | Oppo(重庆)智能科技有限公司 | Intelligent glasses interaction method, intelligent glasses interaction device and intelligent glasses |
CN113806503A (en) * | 2021-08-25 | 2021-12-17 | 北京库睿科技有限公司 | Dialog fusion method, device and equipment |
CN115064167A (en) * | 2022-08-17 | 2022-09-16 | 广州小鹏汽车科技有限公司 | Voice interaction method, server and storage medium |
CN117059074A (en) * | 2023-10-08 | 2023-11-14 | 四川蜀天信息技术有限公司 | Voice interaction method and device based on intention recognition and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140025380A1 (en) * | 2012-07-18 | 2014-01-23 | International Business Machines Corporation | System, method and program product for providing automatic speech recognition (asr) in a shared resource environment |
CN104811777A (en) * | 2014-01-23 | 2015-07-29 | 阿里巴巴集团控股有限公司 | Smart television voice processing method, smart television voice processing system and smart television |
CN106792047A (en) * | 2016-12-20 | 2017-05-31 | Tcl集团股份有限公司 | The sound control method and system of a kind of intelligent television |
US20180004729A1 (en) * | 2016-06-29 | 2018-01-04 | Shenzhen Gowild Robotics Co., Ltd. | State machine based context-sensitive system for managing multi-round dialog |
CN108022590A (en) * | 2016-11-03 | 2018-05-11 | 谷歌有限责任公司 | Focusing session at speech interface equipment |
-
2018
- 2018-06-21 CN CN201810646326.XA patent/CN110634477B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140025380A1 (en) * | 2012-07-18 | 2014-01-23 | International Business Machines Corporation | System, method and program product for providing automatic speech recognition (asr) in a shared resource environment |
CN104811777A (en) * | 2014-01-23 | 2015-07-29 | 阿里巴巴集团控股有限公司 | Smart television voice processing method, smart television voice processing system and smart television |
US20180004729A1 (en) * | 2016-06-29 | 2018-01-04 | Shenzhen Gowild Robotics Co., Ltd. | State machine based context-sensitive system for managing multi-round dialog |
CN108022590A (en) * | 2016-11-03 | 2018-05-11 | 谷歌有限责任公司 | Focusing session at speech interface equipment |
CN106792047A (en) * | 2016-12-20 | 2017-05-31 | Tcl集团股份有限公司 | The sound control method and system of a kind of intelligent television |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111367407A (en) * | 2020-02-24 | 2020-07-03 | Oppo(重庆)智能科技有限公司 | Intelligent glasses interaction method, intelligent glasses interaction device and intelligent glasses |
CN111367407B (en) * | 2020-02-24 | 2023-10-10 | Oppo(重庆)智能科技有限公司 | Intelligent glasses interaction method, intelligent glasses interaction device and intelligent glasses |
CN113806503A (en) * | 2021-08-25 | 2021-12-17 | 北京库睿科技有限公司 | Dialog fusion method, device and equipment |
CN115064167A (en) * | 2022-08-17 | 2022-09-16 | 广州小鹏汽车科技有限公司 | Voice interaction method, server and storage medium |
CN115064167B (en) * | 2022-08-17 | 2022-12-13 | 广州小鹏汽车科技有限公司 | Voice interaction method, server and storage medium |
CN117059074A (en) * | 2023-10-08 | 2023-11-14 | 四川蜀天信息技术有限公司 | Voice interaction method and device based on intention recognition and storage medium |
CN117059074B (en) * | 2023-10-08 | 2024-01-19 | 四川蜀天信息技术有限公司 | Voice interaction method and device based on intention recognition and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110634477B (en) | 2022-01-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110634477B (en) | Context judgment method, device and system based on scene perception | |
JP7335062B2 (en) | Voice service providing method and apparatus | |
US10922355B2 (en) | Method and apparatus for recommending news | |
CN110209843B (en) | Multimedia resource playing method, device, equipment and storage medium | |
CN106098063B (en) | Voice control method, terminal device and server | |
WO2016150083A1 (en) | Information input method and apparatus | |
CN111522909B (en) | Voice interaction method and server | |
CN110557659B (en) | Video recommendation method and device, server and storage medium | |
CN109165302A (en) | Multimedia file recommendation method and device | |
CN107785018A (en) | More wheel interaction semantics understanding methods and device | |
CN113569037A (en) | Message processing method and device and readable storage medium | |
CN109783656B (en) | Recommendation method and system of audio and video data, server and storage medium | |
CN109670020B (en) | Voice interaction method, system and device | |
CN108958503A (en) | input method and device | |
US11756544B2 (en) | Selectively providing enhanced clarification prompts in automated assistant interactions | |
CN110708607A (en) | Live broadcast interaction method and device, electronic equipment and storage medium | |
CN110866200A (en) | Service interface rendering method and device | |
CN111428512A (en) | Semantic recognition method, device and equipment | |
CN111970525B (en) | Live broadcast room searching method and device, server and storage medium | |
CN109275047A (en) | Video information processing method and device, electronic equipment, storage medium | |
CN111105294A (en) | VR navigation method, system, client, server and storage medium thereof | |
CN113132214B (en) | Dialogue method, dialogue device, dialogue server and dialogue storage medium | |
JP2023506087A (en) | Voice Wakeup Method and Apparatus for Skills | |
CN105404681A (en) | Live broadcast sentiment classification method and apparatus | |
CN111954017B (en) | Live broadcast room searching method and device, server and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |