WO2019223351A1 - 基于视图的语音交互方法、装置、服务器、终端和介质 - Google Patents
基于视图的语音交互方法、装置、服务器、终端和介质 Download PDFInfo
- Publication number
- WO2019223351A1 WO2019223351A1 PCT/CN2019/072339 CN2019072339W WO2019223351A1 WO 2019223351 A1 WO2019223351 A1 WO 2019223351A1 CN 2019072339 W CN2019072339 W CN 2019072339W WO 2019223351 A1 WO2019223351 A1 WO 2019223351A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- instruction
- view
- information
- description information
- Prior art date
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 118
- 238000000034 method Methods 0.000 title claims abstract description 81
- 238000012545 processing Methods 0.000 claims description 32
- 230000008569 process Effects 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 9
- 230000005236 sound signal Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 description 60
- 238000011161 development Methods 0.000 description 22
- 230000008878 coupling Effects 0.000 description 10
- 238000010168 coupling process Methods 0.000 description 10
- 238000005859 coupling reaction Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 10
- 230000003287 optical effect Effects 0.000 description 7
- 238000003491 array Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 4
- 238000012423 maintenance Methods 0.000 description 4
- 230000002093 peripheral effect Effects 0.000 description 4
- 230000001133 acceleration Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000012790 confirmation Methods 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 108010001267 Protein Subunits Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000013515 script Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
- G10L2015/228—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of application context
Definitions
- the embodiments of the present application relate to the field of computer technology, for example, to a view-based voice interaction method, device, server, terminal, and medium.
- the embodiments of the present application provide a view-based voice interaction method, device, server, terminal, and medium to solve the problem that voice interaction functions and products have a high degree of business logic coupling, lack of uniformity and poor generality in the development of voice interaction functions. problem.
- An embodiment of the present application provides a view-based voice interaction method, which is applied to a server.
- the method includes:
- the voice instruction description information includes a voice instruction list and configuration information of each voice instruction in the voice instruction list,
- the voice instruction is set to describe a voice operation that can be performed on an element in the view;
- An embodiment of the present application further provides a view-based voice interaction method applied to a terminal.
- the method includes:
- the voice instruction description information includes the voice instruction list and each voice instruction in the voice instruction list.
- Configuration information wherein the voice instruction is set to describe a voice operation that can be performed on an element in the view;
- An embodiment of the present application further provides a view-based voice interaction device configured on a server.
- the device includes:
- the voice and instruction information acquisition module is configured to obtain the user's voice information and the voice instruction description information of a voice-operable element in the current display view of the terminal, wherein the voice instruction description information includes a voice instruction list and the voice instruction list Configuration information of each voice instruction in the voice instruction, the voice instruction is set to describe a voice operation that can be performed on an element in the view;
- the semantic recognition module is configured to perform semantic recognition on the voice information according to the view description information of the voice-operable element to obtain a user's operation intention;
- An instruction sequence determination module configured to locate, from the voice instruction list, a to-be-executed instruction sequence that matches the operation intention according to the voice instruction description information
- the instruction issuing module is configured to issue the instruction sequence to be executed to a terminal for execution.
- An embodiment of the present application further provides a view-based voice interaction device configured on a terminal.
- the device includes:
- the voice and instruction information sending module is configured to send the monitored user's voice information and the voice instruction description information of a voice-operable element in the current display view of the terminal to the server, where the voice instruction description information includes a voice instruction list and Configuration information of each voice instruction in the voice instruction list, the voice instruction set to describe a voice operation that can be performed on an element in the view;
- the instruction sequence execution module is configured to receive an instruction sequence determined by the server according to the voice information and the voice instruction description information, and execute the instruction processing logic corresponding to the voice instruction in the instruction sequence.
- An embodiment of the present application further provides a server, including:
- One or more processors are One or more processors;
- a storage device configured to store one or more programs
- the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors enable the view-based voice interaction method applied to the server according to any embodiment of the present application.
- An embodiment of the present application further provides a terminal, including:
- One or more processors are One or more processors;
- a storage device configured to store one or more programs
- the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the view-based voice interaction method applied to the terminal according to any embodiment of the present application.
- An embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored.
- the program is executed by a processor, the method for implementing a view-based voice interaction method applied to a server according to any of the embodiments of the present application is implemented. .
- This embodiment of the present application also provides another computer-readable storage medium on which a computer program is stored.
- the program is executed by a processor, the view-based voice interaction applied to a terminal according to any one of the embodiments of the present application is implemented. method.
- FIG. 1 is a flowchart of a view-based voice interaction method provided in Embodiment 1 of the present application; FIG.
- FIG. 2 is a flowchart of a view-based voice interaction method provided in Embodiment 2 of the present application;
- FIG. 3 is a flowchart of a view-based voice interaction method provided in Embodiment 3 of the present application.
- FIG. 4 is a flowchart of a view-based voice interaction method provided in Embodiment 4 of the present application.
- FIG. 5 is a schematic structural diagram of a view-based voice interaction device provided in Embodiment 5 of the present application.
- FIG. 6 is a schematic structural diagram of a view-based voice interaction device provided in Embodiment 6 of the present application.
- FIG. 7 is a schematic structural diagram of a server provided in Embodiment 7 of the present application.
- FIG. 8 is a schematic structural diagram of a terminal provided in Embodiment 8 of the present application.
- FIG. 1 is a flowchart of a view-based voice interaction method provided in Embodiment 1 of this application. This embodiment may be applicable to a case where an application implements view-based voice interaction in a server.
- the method may be implemented by a view-based voice interaction device.
- the device may be implemented in software and / or hardware, and may be integrated in a server. As shown in FIG. 1, the method specifically includes:
- the voice command description information includes the voice command list and the configuration information of each voice command in the voice command list. Set to describe the voice actions that can be performed on elements in the view.
- the view in this embodiment includes a view that can be operated by the user's voice on the terminal, and the elements displayed on the view include elements that can be operated by voice and elements that cannot be operated by voice. Therefore, voice-action is directed to the view. Elements that can be manipulated by voice. Voice instructions are the core part of deciding whether or not the view element can be manipulated.
- the voice command description information is a developer's preset voice interaction configuration information based on the operation corresponding to the view element of the terminal based on the standardized voice programming language.
- each voice-operable element has a corresponding voice command and related information.
- the speech programming language is a computer program language specially developed by developers in this embodiment in order to achieve the standardization and universality of speech interaction functions. The main purpose of development is to separate the speech interaction functions from the view presentation logic and simplify the speech interaction functions.
- a programming interface Application Programming Interface, API
- the voice instructions in the voice instruction description information exist as attributes of the view elements, and are used to describe the voice operations that the user can perform, and their functions can be extended through scripts. At the same time, the voice instructions are universal and can be performed with the components in the view. Flexible combination.
- the related configuration information of the voice command can be configured through the voice attribute (voice-config).
- the voice command list in the voice command description information refers to all voice commands on the currently displayed view, and can be collected and organized in a list form.
- S120 Perform semantic recognition on the user's voice information according to the view description information of the voice-operable elements, and obtain the user's operation intention.
- the view description information of an element includes the display scene information such as the element name, text label, and coordinate distribution of the element on the view related to the specific composition of the view.
- the server semantically recognizes the user's voice information based on the view description information of the element, and can identify the user's voice
- the key information in the view is matched with the elements in the view to obtain the user's operation intention that matches the currently displayed view.
- the sequence of instructions to be executed can be located by matching between the user's operation intention and the voice instruction description information of the voice-operable element.
- the voice instruction description information of the voice-operable element in the current display view of the terminal obtained by the server further includes a voice tag, where the voice tag is set to describe the voice-operable element information on the view.
- the voice tag can be set to assist in identifying and understanding the content of the view, and to find the corresponding voice command more accurately.
- developers of speech programming languages can adaptively set speech labels. For example, for simple views, there is a certain one-to-one relationship between voice instructions and operations in the view, and voice tags may not be set; for complex views, consider that there may be the same voice instructions, but essentially correspond to different operation elements in the view In this case, you need to set a voice tag.
- a user purchases a ticket by voice
- the user's voice information is to buy a ticket from place X to place Y, whether it is a train ticket or an air ticket, in the view
- the start and end points need to be entered in the address input box. , Select the departure time in the time box, etc.
- the voice commands corresponding to these operations are the same.
- the voice instruction of the plane ticket from the purchase place X to the place Y to be executed can be located according to the voice tag corresponding to the operable element of the purchased plane ticket.
- the implementation of voice interaction is usually performed by semantically identifying the voice information entered by the user with the information of the controllable control object on the displayed page to trigger the corresponding page operation.
- the controllable control object is not available. It is instructed by voice, and there is no unified formation of a standardized voice programming language; especially for the data description of the voice-operable elements of the view, the front-end R & D engineers and strategy R & D engineers need to work together with the corresponding data and voice operation content, one by one through code Implementation is very complicated for subsequent upgrades and iterations.
- the voice information input by the user is matched with the voice instruction description information of the view element voice instruction.
- the voice instruction description information is a voice instruction and configuration information based on a standardized voice programming language setting.
- the server directly associates the voice information input by the user with the specific instruction and instruction configuration information after the view element is instructed.
- the developer does not need to specifically deal with the components in the view. Pay attention to the voice instructions and instruction configuration information corresponding to the voice-operable elements on the view, to realize the decoupling of the voice interaction function and the business logic of the product, which is convenient for separate maintenance.
- the common solution does not implement this decoupling function. In the development of voice interaction, developers still need to process the view component at the same time for the voice interaction function to be implemented.
- the server sends the instruction sequence that matches the user's operation intention to the terminal, and the terminal performs the corresponding operation according to the received instruction sequence to complete the user's needs.
- the technical solution of this embodiment is implemented based on a standardized voice programming language.
- the voice information is semantically recognized to obtain the user's operation intention; further, an instruction sequence matching the user's operation intention is located from the voice instruction list; finally, the instruction sequence is issued to the terminal for execution.
- This embodiment solves the problems of high coupling between the voice interaction function and the business logic of the product, the lack of uniformity and poor generality in the development of the voice interaction function.
- the developer can perform voice operations on the view.
- the elements can be configured with the voice instruction description information.
- the voice instructions can be added in the form of a label to realize the decoupling of the voice interaction function and the business logic of the product, which is convenient for separate maintenance.
- the unity of voice interaction function development and the Standardization further improves the versatility of the voice interaction function and can be quickly extended to different application scenarios.
- FIG. 2 is a flowchart of a view-based voice interaction method provided in Embodiment 2 of the present application. This embodiment is further optimized based on the foregoing embodiment. As shown in FIG. 2, the method specifically includes:
- performing voice recognition on the user's voice information according to the view description information of the voice-operable elements to obtain the corresponding query text includes:
- a pre-trained language model is used to dynamically decode the predicted acoustic features based on the view description information of speech-operable elements to generate corresponding query text.
- the acoustic features are dynamically decoded by combining view description information, that is, the acoustic features are dynamically decoded by combining the view structure and the relationship between the elements in the view. , So as to specifically identify the query text corresponding to the voice information, so as to more accurately identify the user's intention.
- the server may use the acoustic model and the language model to generate the query text corresponding to the user's voice information through feature prediction and dynamic decoding, or may use other speech recognition methods in the art to obtain the query text, which is not limited in this embodiment.
- Acoustic models include, but are not limited to, hidden Markov models, and dynamic decoding can also be implemented using speech decoders.
- the view element can include multiple types, and the text label can distinguish the view element.
- the server can simultaneously extract the text labels of the elements, so as to semantically mark the query text corresponding to the user's voice information, so as to better understand the user's intention in conjunction with the content displayed in the view.
- execution order of operations S220 and S230 is not limited, as long as it is ensured that the text labels of the query text and the elements corresponding to the user's voice information are successfully obtained before the semantic annotation is performed.
- Acoustic models, language models, and annotation models can be updated periodically during the semantic recognition process to ensure the accuracy of semantic recognition.
- the server uses the annotation model to obtain the user's operation intention, and then can determine the voice instruction in the voice instruction list.
- the developers of the speech programming language pre-configure the correspondence between semantics and voice instructions. After determining the user's operating intention, use the correspondence and the voice instruction description information of the voice-operable elements on the view.
- the voice instructions are gradually located, and then an instruction sequence including the ID of the voice instruction and the key value of the instruction configuration information is formed.
- the ID of the voice instruction is the unique identifier of the voice instruction. For example, it can be used to identify each voice instruction in the sequence. Position, key value is used to identify the specific execution characteristics corresponding to the voice command.
- the corresponding voice instruction is submission, and the key value of the instruction configuration information is OK or cancel; for the playback operation of fast-forwarding to 2 minutes 30s, The corresponding voice command is fast forward, and the key value of the command configuration information is 2 minutes and 30 seconds.
- the content displayed on the current view is the movie leaderboard of actor A, and the ranking category includes the hottest, latest, and praise.
- the voice in the view The instruction includes three list switching instructions. The key values in the configuration information of the list switching instruction are: the hottest, the latest, and the praise.
- the process of locating the instruction sequence may include: using a pre-configured correspondence between the semantics and the voice instruction to determine a target voice instruction set from the voice instruction list; and according to the voice instruction description information such as the voice tag and the instruction configuration information
- the key value is to locate an instruction sequence matching the user's operation intention from the target voice instruction set.
- the server operates according to the identified user
- the intention is to first determine that the voice instruction for selecting a song operation in the current view is a selection.
- a target voice instruction set including multiple song names is determined.
- each song name corresponds to a choice.
- the target voice instruction set may not be determined, but the voice instructions for selecting specific song names are determined one by one directly according to the song names in the user's voice information, and then issued to the terminal in a list form.
- the technical solution of this embodiment obtains the user's voice information and the voice instruction description information of the voice-operable elements in the current display view of the terminal; sequentially performs voice recognition and After the semantic annotation, the user's operation intention is obtained; then the instruction sequence matching the user's operation intention is located from the voice instruction list; finally, the instruction sequence is issued to the terminal for execution.
- This embodiment solves the problems of high coupling between the voice interaction function and the business logic of the product, the lack of uniformity and poor generality in the development of the voice interaction function.
- the developer can perform voice operations on the view.
- the elements can be configured with the voice instruction description information, which realizes the decoupling of the voice interaction function and the business logic of the product, which is convenient for separate maintenance.
- the unity and standardization of the development of the voice interaction function is achieved, thereby improving the generality of the voice interaction function It can be quickly extended to different application scenarios.
- FIG. 3 is a flowchart of a view-based voice interaction method provided in Embodiment 3 of the present application.
- This embodiment is applicable to a case where an application implements view-based voice interaction in a terminal, and is the same as that applied to a server in the foregoing embodiment of the present application.
- View-based voice interaction methods work together.
- the method can be executed by a view-based voice interaction device, which can be implemented in software and / or hardware, and can be integrated in a terminal, such as a smart terminal such as a mobile phone, tablet computer, and personal computer.
- the method specifically includes:
- S310 Send the monitored user's voice information and the voice instruction description information of the voice-operable element in the current display view of the terminal to the server, where the voice instruction description information includes the voice instruction list and each voice instruction in the voice instruction list.
- Configuration information voice instructions are set to describe the voice operations that can be performed on elements in the view.
- the voice instruction description information further includes a voice tag, where the voice tag is set to describe voice-operable element information on the view.
- the terminal monitors the user's voice information. Specifically, the user can collect the user's voice information through a microphone or an external sound collection device connected to the terminal, and then send the voice information to the server.
- the terminal processor monitors the user's voice input event, the terminal processor simultaneously sends the voice instruction description information of the voice-operable element in the current display view to the server. Information and data can be shared between the terminal and the server through network communication.
- the process can include three steps: voice object initialization, voice instruction initialization, and voice instruction data collection.
- voice object initialization includes monitoring the user's voice input, registering the voice object configuration, and initializing the view page voice object;
- the voice instruction initialization includes the document object model (DOM) analysis of the view, constructing the instruction configuration, and initializing the instruction configuration;
- Voice command data collection includes configuration data to provide commands, build command processors, and update data information.
- S320 Receive an instruction sequence determined by the server according to user voice information and voice instruction description information, and execute instruction processing logic corresponding to the voice instruction in the instruction sequence.
- the server analyzes the instruction according to the user's voice information and the voice instruction description information of the elements that can be operated by voice, determines a matching instruction sequence, and sends the instruction sequence to the terminal.
- the terminal receives the response from the server and sends the corresponding command sequence to the command router of the terminal.
- the instruction router determines the voice instruction to be executed according to the instruction sequence, initializes the corresponding voice instruction, and then executes the corresponding instruction processing logic.
- executing the instruction processing logic corresponding to the voice instruction in the instruction sequence includes:
- Voice events can be set to define personalized product logic based on specific instruction processing logic on the view, such as the way voice instructions are executed or product display effects.
- the content displayed on the current view is the hottest movie ranking of actor A.
- the ranking category also includes the latest and praise.
- the voice instructions in the view include three listchage instructions.
- the key to the configuration information of the list switching instruction The values are: Hottest, Newest, and Positive.
- the terminal will display the hottest currently displayed according to the received command sequence.
- the movie leaderboard switches to the popular movie leaderboard, and at the same time locks the second movie on the popular movie leaderboard for playback.
- the voice event related to playing the second movie such as the poster of the second movie Perform special effects display, specifically, enlarge and highlight the poster icon of the movie, and then play the movie. Therefore, the setting of voice events can increase the diversification and interest of voice interaction functions, and enable users to produce a better product experience.
- the technical solution of this embodiment sends the monitored user's voice information and the voice instruction description information of the voice-operable elements in the current display view of the terminal to the server, and then receives instructions determined by the server based on the user's voice information and voice instruction description information.
- Sequence, and execute corresponding processing logic which solves the problems of high coupling degree between voice interaction function and product business logic, lack of uniformity and poor generality of voice interaction function development, and realizes the solution of voice interaction function and product business logic Coupling; realizes the standardization of voice interaction functions, thereby improving the versatility of voice interaction functions, and can be quickly extended to different application scenarios.
- FIG. 4 is a flowchart of a view-based voice interaction method provided in Embodiment 4 of the present application. This embodiment is further optimized based on the foregoing embodiment. As shown in FIG. 4, the method specifically includes:
- S410 Send the monitored user's voice information and the voice instruction description information of the voice-operable element in the current display view of the terminal to the server, where the voice instruction description information includes the voice instruction list and each voice instruction in the voice instruction list.
- Configuration information voice instructions are set to describe the voice operations that can be performed on elements in the view.
- S420 Receive an instruction sequence determined by the server according to user voice information and voice instruction description information, where the instruction sequence includes an ID of at least one voice instruction and a key value in its configuration information.
- the instruction sequence includes an ID of the voice instruction and a key value in its configuration information, execute the corresponding instruction processing logic according to the voice instruction ID and the key value.
- a unique voice instruction can be matched according to the user's current voice information, without requiring multiple interactions with the user. For example, when the user's voice information is confirmed, the corresponding voice instruction is submitted, the key value of the voice instruction is confirmed (OK), and the terminal performs a confirmation operation according to the submitted instruction.
- instruction sequence includes two or more voice command IDs and key values in their configuration information, determine the target voice command in the command sequence by interacting with the terminal, and execute the corresponding one according to the target voice command ID and key value. Instruction processing logic.
- the final target voice instruction needs to be determined through user interaction with the terminal. For example, in the currently displayed player view, the voice information input by the user is listening to a song, and the corresponding voice instruction is selected. According to the voice tag-song list, a sequence of instructions for playing voice instructions including multiple songs can be determined. When the user needs to continue to input voice information about the song name R, the user can determine a voice instruction for playing the song name R, and then the terminal plays the song R according to the voice instruction.
- the technical solution of this embodiment sends the monitored user's voice information and the voice instruction description information of the voice-operable elements in the current display view of the terminal to the server, and receives instructions determined by the server based on the user's voice information and voice instruction description information.
- the target voice instruction is determined through interaction with the user, and the corresponding processing logic is executed, which solves the high degree of coupling between the voice interaction function and the business logic of the product and the lack of uniformity in the development of the voice interaction function
- the decoupling of the voice interaction function and the business logic of the product has been achieved; the standardization of the voice interaction function has been achieved, and the versatility of the voice interaction function has been improved, which can be quickly extended to different application scenarios.
- the following is an embodiment of a view-based voice interaction device provided by an embodiment of the present application.
- This device belongs to the same application concept as the view-based voice interaction method of the foregoing embodiments, and is not described in the embodiment of the view-based voice interaction device.
- FIG. 5 is a schematic structural diagram of a view-based voice interaction device provided in Embodiment 5 of the present application, which can be configured in a server, and this embodiment is applicable to a case of implementing view-based voice interaction.
- the view-based voice interaction device provided in the embodiment of the present application can execute the view-based voice interaction method applied to a server provided in any embodiment of the present application, and has corresponding function modules and beneficial effects of the execution method.
- the device specifically includes a voice and instruction information acquisition module 510, a semantic recognition module 520, an instruction sequence determination module 530, and an instruction issuing module 540, where:
- the voice and instruction information acquisition module 510 is configured to obtain the user's voice information and the voice instruction description information of the voice-operable elements in the current display view of the terminal, where the voice instruction description information includes each of the voice instruction list and the voice instruction list Configuration information for voice instructions, which are set to describe voice operations that can be performed on elements in the view.
- the voice instruction description information acquired in the voice and instruction information acquisition module 510 further includes a voice tag, where the voice tag is set to describe voice-operable element information on the view.
- the semantic recognition module 520 is configured to perform semantic recognition on the user's voice information according to the view description information of the voice-operable elements, and obtain the user's operation intention.
- the instruction sequence determination module 530 is configured to locate an instruction sequence that matches the user's operation intention from the voice instruction list according to the voice instruction description information of the voice-operable element.
- the instruction sequence determination module 530 is specifically configured as:
- an instruction sequence matching the user's operation intention is located in the voice instruction list, where the instruction sequence includes at least one voice instruction ID and its configuration information Key figure in.
- the instruction issuing module 540 is configured to issue the positioning instruction sequence to the terminal for execution.
- the semantic recognition module 520 includes a query text determination unit, a text label extraction unit, and an operation intention determination unit, where:
- the query text determining unit is configured to perform voice recognition on the user's voice information according to the view description information of the voice-operable element to obtain a corresponding query text;
- the text label extraction unit is configured to extract a text label of a speech-operable element from view description information of the speech-operable element, where the text label includes a type and attribute of the speech-operable element;
- the operation intention determining unit is set to use a pre-trained annotation model to semantically annotate the query text according to the extracted text labels, and obtain the semantic annotation result of the query text, that is, the operation intention of the user.
- the query text determination unit includes an acoustic feature prediction subunit and a text generation subunit, where:
- An acoustic feature prediction subunit configured to predict an acoustic feature of an audio signal of a user's voice information using a pre-trained acoustic model
- the text generation subunit is set to use a pre-trained language model to dynamically decode the predicted acoustic features based on the view description information of the voice-operable elements to generate corresponding query text.
- the technical solution of this embodiment is implemented based on a standardized voice programming language.
- the voice information is semantically recognized to obtain the user's operation intention; further, an instruction sequence matching the user's operation intention is located from the voice instruction list; finally, the instruction sequence is issued to the terminal for execution.
- This embodiment solves the problems of high coupling between the voice interaction function and the business logic of the product, the lack of uniformity and poor generality in the development of the voice interaction function.
- the developer can perform voice operations on the view.
- the elements can be configured with the voice instruction description information, which realizes the decoupling of the voice interaction function and the business logic of the product, which is convenient for separate maintenance.
- the unity and standardization of the development of the voice interaction function is achieved, thereby improving the generality of the voice interaction function. It can be quickly extended to different application scenarios.
- FIG. 6 is a schematic structural diagram of a view-based voice interaction device provided in Embodiment 6 of the present application, which can be configured in a terminal. This embodiment is applicable to a case where view-based voice interaction is implemented.
- the view-based voice interaction device provided in the embodiment of the present application can execute the view-based voice interaction method applied to a terminal provided in any embodiment of the present application, and has corresponding function modules and beneficial effects of the execution method.
- the device specifically includes a voice and instruction information sending module 610 and an instruction sequence execution module 620, where:
- the voice and command information sending module 610 is configured to send the monitored voice information of the user and the voice command description information of the voice-operable element in the current display view of the terminal to the server, where the voice command description information includes a voice command list and voice The configuration information of each voice command in the command list, and the voice command is set to describe the voice operations that can be performed on the elements in the view.
- the voice instruction description information sent by the voice and instruction information sending module 610 further includes a voice tag, where the voice tag is set to describe voice-operable element information on the view.
- the instruction sequence execution module 620 is configured to receive an instruction sequence determined by the server based on user voice information and voice instruction description information, and execute instruction processing logic corresponding to the voice instruction in the instruction sequence.
- the instruction sequence execution module 620 includes a receiving unit and an execution unit, where:
- a receiving unit configured to receive an instruction sequence determined by the server based on the user's voice information and voice instruction description information
- the execution unit is configured to execute instruction processing logic corresponding to a voice instruction in the received instruction sequence.
- execution unit is specifically set as:
- the instruction sequence received in the instruction sequence execution module 620 includes an ID of at least one voice instruction and a key value in its configuration information;
- the execution unit includes a first execution sub-unit and a second execution sub-unit, where:
- a first execution subunit configured to execute a corresponding instruction processing logic according to the voice instruction ID and the key value if the instruction sequence includes an ID of the voice instruction and a key value in its configuration information
- the second execution subunit is set to determine the target voice command in the command sequence by interacting with the terminal if the command sequence includes two or more voice command IDs and key values in their configuration information, and according to the target voice command ID Execute the corresponding instruction processing logic with the key value.
- the technical solution of this embodiment sends the monitored user's voice information and the voice instruction description information of the voice-operable elements in the current display view of the terminal to the server, and then receives instructions determined by the server based on the user's voice information and voice instruction description information.
- Sequence, and execute corresponding processing logic which solves the problems of high coupling degree between voice interaction function and product business logic, lack of uniformity and poor generality of voice interaction function development, and realizes the solution of voice interaction function and product business logic Coupling; realizes the standardization of voice interaction functions, thereby improving the versatility of voice interaction functions, and can be quickly extended to different application scenarios.
- FIG. 7 is a schematic structural diagram of a server provided in Embodiment 7 of the present application.
- FIG. 7 illustrates a block diagram of an exemplary server 712 suitable for use in implementing embodiments of the present application.
- the server 712 shown in FIG. 7 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present application.
- the server 712 is expressed in the form of a general-purpose server.
- the components of the server 712 may include, but are not limited to, one or more processors 716, a storage device 728, and a bus 718 connecting different system components (including the storage device 728 and the processor 716).
- the bus 718 represents one or more of several types of bus structures, including a storage device bus or a storage device controller, a peripheral bus, a graphics acceleration port, a processor, or a local area bus using any of a variety of bus structures.
- these architectures include, but are not limited to, the Industry Standard Architecture (Industry Alliance) bus, the Micro Channel Architecture (MAC) bus, the enhanced ISA bus, and the Video Electronics Standards Association (Video Electronics Standards) Association (VESA) local area bus and Peripheral Component Interconnect (PCI) bus.
- the server 712 typically includes a variety of computer system-readable media. These media can be any available media that can be accessed by the server 712, including volatile and non-volatile media, removable and non-removable media.
- the storage device 728 may include a computer system readable medium in the form of a volatile memory, such as a Random Access Memory (RAM) 730 and / or a cache memory 732.
- the server 712 may further include other removable / non-removable, volatile / nonvolatile computer system storage media.
- the storage system 734 may be configured to read and write non-removable, non-volatile magnetic media (not shown in FIG. 7 and is commonly referred to as a “hard drive”).
- each drive may be connected to the bus 718 through one or more data medium interfaces.
- the storage device 728 may include at least one program product having a set (for example, at least one) of program modules configured to perform the functions of the embodiments of the present application.
- a program / utility tool 740 having a set (at least one) of program modules 742 may be stored in, for example, a storage device 728.
- Such program modules 742 include, but are not limited to, an operating system, one or more application programs, other program modules, and programs Data, each or some combination of these examples may include an implementation of the network environment.
- the program module 742 generally performs functions and / or methods in the embodiments described in this application.
- the server 712 may also communicate with one or more external devices 714 (such as a keyboard, pointing terminal, display 724, etc.), and may also communicate with one or more terminals that enable a user to interact with the server 712, and / or with the
- the server 712 can communicate with any terminal (such as a network card, modem, etc.) that is in communication with one or more other computing terminals. This communication may occur through an input / output (I / O) interface 722.
- the server 712 may also communicate with one or more networks (such as a local area network (LAN), a wide area network (WAN), and / or a public network, such as the Internet) through the network adapter 720. As shown in FIG.
- the network adapter 720 communicates with other modules of the server 712 through the bus 718. It should be understood that although not shown in the figure, other hardware and / or software modules may be used in conjunction with the server 712, including but not limited to: microcode, terminal drives, redundant processors, external disk drive arrays, and disk arrays (Redundant Arrays) Independent Disks (RAID) systems, tape drives, and data backup storage systems.
- RAID Redundant Arrays
- tape drives and data backup storage systems.
- the processor 716 executes various functional applications and data processing by running a program stored in the storage device 728, for example, implementing a view-based voice interaction method applied to a server provided in the embodiment of the present application.
- the method includes:
- the voice command description information includes the voice command list and the configuration information of each voice command in the voice command list.
- the voice command is set to Describe the voice actions that can be performed on elements in the view;
- the voice instruction description information of the voice-operable elements locate the instruction sequence that matches the user's operation intention from the voice instruction list;
- FIG. 8 is a schematic structural diagram of a terminal provided in Embodiment 8 of the present application.
- FIG. 8 shows a block diagram of an exemplary terminal 812 suitable for implementing the embodiments of the present application.
- the terminal 812 shown in FIG. 8 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present application.
- the terminal 812 is expressed in the form of a universal terminal.
- the components of the terminal 812 may include, but are not limited to, one or more processors 816, a storage device 828, and a bus 818 connecting different system components (including the storage device 828 and the processor 816).
- the bus 818 represents one or more of several types of bus structures, including a storage device bus or a storage device controller, a peripheral bus, a graphics acceleration port, a processor, or a local area bus using any of a variety of bus structures.
- these architectures include, but are not limited to, the Industry Standard Architecture (Industry Alliance) bus, the Micro Channel Architecture (MAC) bus, the enhanced ISA bus, and the Video Electronics Standards Association (Video Electronics Standards) Association (VESA) local area bus and Peripheral Component Interconnect (PCI) bus.
- the terminal 812 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the terminal 812, including volatile and non-volatile media, removable and non-removable media.
- the storage device 828 may include a computer system-readable medium in the form of a volatile memory, such as a Random Access Memory (RAM) 830 and / or a cache memory 832.
- the terminal 812 may further include other removable / non-removable, volatile / nonvolatile computer system storage media.
- the storage system 834 may be configured to read and write non-removable, non-volatile magnetic media (not shown in FIG. 8 and is commonly referred to as a “hard drive”).
- a disk drive configured to read from and write to a removable non-volatile disk (such as a “floppy disk”) and a removable non-volatile disk such as a read-only optical disk (Compact Disc Disc Read) may be provided.
- a removable non-volatile disk such as a “floppy disk”
- a removable non-volatile disk such as a read-only optical disk (Compact Disc Disc Read)
- CD-ROM Compact Disc Disc Read
- DVD-ROM Digital Video Disc
- each drive may be connected to the bus 818 through one or more data medium interfaces.
- the storage device 828 may include at least one program product having a set (for example, at least one) of program modules configured to perform the functions of the embodiments of the present application.
- a program / utility tool 840 having a set (at least one) of program modules 842 may be stored in, for example, a storage device 828.
- Such program modules 842 include, but are not limited to, an operating system, one or more application programs, other program modules, and programs Data, each or some combination of these examples may include an implementation of the network environment.
- the program module 842 generally performs functions and / or methods in the embodiments described in this application.
- the terminal 812 may also communicate with one or more external devices 814 (such as a keyboard, pointing terminal, display 824, etc.), and may also communicate with one or more terminals that enable a user to interact with the terminal 812, and / or with the terminal
- the terminal 812 can communicate with any terminal (eg, network card, modem, etc.) that communicates with one or more other computing terminals. This communication can be performed through an input / output (I / O) interface 822.
- the terminal 812 can also communicate with one or more networks (such as a local area network (LAN), a wide area network (WAN), and / or a public network, such as the Internet) through the network adapter 820. As shown in FIG.
- the network adapter 820 communicates with other modules of the terminal 812 through the bus 818. It should be understood that although not shown in the figure, other hardware and / or software modules may be used in conjunction with the terminal 812, including but not limited to: microcode, terminal drives, redundant processors, external disk drive arrays, and disk arrays (Redundant Arrays) Independent Disks (RAID) systems, tape drives, and data backup storage systems.
- the processor 816 executes various functional applications and data processing by running a program stored in the storage device 828, for example, implementing a view-based voice interaction method applied to a terminal provided in the embodiment of the present application.
- the method includes:
- the voice instruction is set to describe the voice operations that can be performed on the elements in the view;
- Embodiment 9 of the present application further provides a computer-readable storage medium on which a computer program is stored.
- the program When the program is executed by a processor, the method implements a view-based voice interaction method applied to a server as provided in the embodiment of the present application.
- the method includes:
- the voice command description information includes the voice command list and the configuration information of each voice command in the voice command list.
- the voice command is set to Describe the voice actions that can be performed on elements in the view;
- the voice instruction description information of the voice-operable elements locate the instruction sequence that matches the user's operation intention from the voice instruction list;
- the computer storage medium in the embodiments of the present application may adopt any combination of one or more computer-readable media.
- the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
- the computer-readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof.
- a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in combination with an instruction execution system, apparatus, or device.
- the computer-readable signal medium may include a data signal propagated in baseband or transmitted as part of a carrier wave, which carries a computer-readable program code. Such a propagated data signal may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- the computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, and the computer-readable medium may send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
- Program code embodied on a computer-readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
- the computer program code for performing the operations of this application may be written in one or more programming languages, or a combination thereof, including programming languages such as Java, Smalltalk, C ++, and also conventional Procedural programming language—such as "C" or similar programming language.
- the program code can be executed entirely on the user's computer, partly on the user's computer, as an independent software package, partly on the user's computer, partly on a remote computer, or entirely on the remote computer or terminal.
- the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider) Internet connection).
- LAN local area network
- WAN wide area network
- Internet service provider Internet service provider
- An embodiment of the present application further provides another computer-readable storage medium, and when the computer program stored on the computer program is executed by a processor, a view-based voice interaction method applied to a terminal may be implemented.
- the method includes:
- the voice instruction is set to describe the voice operations that can be performed on the elements in the view;
- the computer-readable storage medium provided in the embodiments of the present application is not limited to the method operations described above, and can also execute the view-based voice interaction method applied to a terminal provided in any embodiment of the present application. Related operations. For a description of the storage medium, refer to the content explanation in Embodiment 9.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- User Interface Of Digital Computer (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种基于视图的语音交互方法,包括:获取用户的语音信息,和终端当前显示视图中可语音操作的元素的语音指令描述信息;依据可语音操作的元素的视图描述信息对用户的语音信息进行语义识别,得到用户的操作意图;依据语音指令描述信息,从语音指令列表中定位出与用户操作意图相匹配的指令序列;将指令序列下发到终端执行。还提供一种基于视图的语音交互装置、服务器、终端和介质。
Description
本申请要求在2018年5月23日提交中国专利局、申请号为201810501073.7的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
本申请实施例涉及计算机技术领域,例如涉及一种基于视图的语音交互方法、装置、服务器、终端和介质。
人工智能的发展,使得语音交互成为了极具竞争力的交互方式。对于互联网产品而言,将传统的交互方式与语音交互相结合,可以为用户带来更好的体验。
但是,现有的语音交互方法存在以下缺点:
1)语音交互的功能和产品的业务逻辑耦合度较高,无法分开维护;
2)每个开发者都需要关注语音交互全流程,需要自己实现相关的细节和过程,导致不同的语音交互开发程序实现的功能很难统一;
3)语音交互开发过程无法标准化,使得语音交互不能快速扩展到不同的应用场景。
发明内容
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的 保护范围。
本申请实施例提供一种基于视图的语音交互方法、装置、服务器、终端和介质,以解决语音交互功能和产品的业务逻辑耦合度较高、语音交互功开发缺乏统一性以及通用性较差的问题。
本申请实施例提供了一种基于视图的语音交互方法,应用于服务器,该方法包括:
获取用户的语音信息,和终端当前显示视图中可语音操作的元素的语音指令描述信息,其中,所述语音指令描述信息包括语音指令列表和所述语音指令列表中每个语音指令的配置信息,所述语音指令设置为描述对视图中元素可执行的语音操作;
依据所述可语音操作的元素的视图描述信息对所述语音信息进行语义识别,得到用户的操作意图;
依据所述语音指令描述信息,从所述语音指令列表中定位出与所述操作意图相匹配的指令序列;
将所述指令序列下发到终端执行。
本申请实施例还提供了一种基于视图的语音交互方法,应用于终端,该方法包括:
将监听到的用户的语音信息和终端当前显示视图中可语音操作的元素的语音指令描述信息发送至服务器,其中,所述语音指令描述信息包括语音指令列表和语音指令列表中每个语音指令的配置信息,所述语音指令设置为描述对视图中元素可执行的语音操作;
接收来自服务器根据所述语音信息和语音指令描述信息确定的指令序列, 并执行所述指令序列中语音指令对应的指令处理逻辑。
本申请实施例还提供了一种基于视图的语音交互装置,配置于服务器,该装置包括:
语音及指令信息获取模块,设置为获取用户的语音信息,和终端当前显示视图中可语音操作的元素的语音指令描述信息,其中,所述语音指令描述信息包括语音指令列表和所述语音指令列表中每个语音指令的配置信息,所述语音指令设置为描述对视图中元素可执行的语音操作;
语义识别模块,设置为依据所述可语音操作的元素的视图描述信息对所述语音信息进行语义识别,得到用户的操作意图;
指令序列确定模块,设置为依据所述语音指令描述信息,从所述语音指令列表中定位出与所述操作意图相匹配的待执行的指令序列;
指令下发模块,设置为将所述待执行的指令序列下发到终端执行。
本申请实施例还提供了一种基于视图的语音交互装置,配置于终端,该装置包括:
语音及指令信息发送模块,设置为将监听到的用户的语音信息和终端当前显示视图中可语音操作的元素的语音指令描述信息发送至服务器,其中,所述语音指令描述信息包括语音指令列表和语音指令列表中每个语音指令的配置信息,所述语音指令设置为描述对视图中元素可执行的语音操作;
指令序列执行模块,设置为接收来自服务器根据所述语音信息和语音指令描述信息确定的指令序列,并执行所述指令序列中语音指令对应的指令处理逻辑。
本申请实施例还提供了一种服务器,包括:
一个或多个处理器;
存储装置,设置为存储一个或多个程序,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本申请任一实施例所述的应用于服务器的基于视图的语音交互方法。
本申请实施例还提供了一种终端,包括:
一个或多个处理器;
存储装置,设置为存储一个或多个程序,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本申请任一实施例所述的应用于终端的基于视图的语音交互方法。
本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请任一实施例所述的应用于服务器的基于视图的语音交互方法。
本申请实施例还提供了另一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请任一实施例所述的应用于终端的基于视图的语音交互方法。
在阅读并理解了附图和详细描述后,可以明白其他方面。
图1是本申请实施例一提供的基于视图的语音交互方法的流程图;
图2是本申请实施例二提供的基于视图的语音交互方法的流程图;
图3是本申请实施例三提供的基于视图的语音交互方法的流程图;
图4是本申请实施例四提供的基于视图的语音交互方法的流程图;
图5是本申请实施例五提供的基于视图的语音交互装置的结构示意图;
图6是本申请实施例六提供的基于视图的语音交互装置的结构示意图;
图7是本申请实施例七提供的一种服务器的结构示意图;
图8是本申请实施例八提供的一种终端的结构示意图。
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅设置为解释本申请,而非对本申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。
实施例一
图1是本申请实施例一提供的基于视图的语音交互方法的流程图,本实施例可适用于应用在服务器中实现基于视图的语音交互的情况,该方法可以由基于视图的语音交互装置来执行,该装置可以采用软件和/或硬件的方式实现,并可集成在服务器中。如图1所示,该方法具体包括:
S110、获取用户的语音信息,和终端当前显示视图中可语音操作的元素的语音指令描述信息,其中,语音指令描述信息包括语音指令列表和语音指令列表中每个语音指令的配置信息,语音指令设置为描述对视图中元素可执行的语音操作。
本实施例中的视图包括终端上可以通过用户的语音进行操作的视图,视图 上显示的元素包括可语音操作的元素和不可语音操作的元素,因此,语音指令(voice-action)是针对视图中可以被语音操作的元素,语音指令是决定能不能操作视图元素的核心部分。
语音指令描述信息是开发人员基于标准化的语音编程语言,根据终端的视图元素对应的操作预先设定的语音交互配置信息,在视图中,每个可语音操作的元素均有对应的语音指令及相关配置信息。所述语音编程语言是本实施例中开发人员为了实现语音交互功能的标准化和通用性而专门开发的计算机程序语言,开发主要目的在于:把语音交互功能和视图展现逻辑分开,简化语音交互功能的编程复杂度,通用化语音交互流程以及语音交互功能的代码逻辑;通过封装语音交互核心技术,为产品开发者提供一套规范和基础框架,即一套通用的处理过程,基于简单实用的高级应用程序编程接口(Application Programming Interface,API),实现产品开发者在视图上,例如html视图、xml视图或者jsx视图,快速增加丰富的语音交互功能的效果。
语音指令描述信息中的语音指令作为视图元素的属性存在,用来描述用户可以执行的语音操作,并且可以通过脚本对其功能进行扩展,同时,语音指令具有通用性,可以与视图中的组件进行灵活组合。语音指令的相关配置信息,可以通过语音属性(voice-config)来配置。语音指令描述信息中的语音指令列表是指当前显示视图上所有的语音指令,可以通过收集并以列表的形式进行组织。
S120、依据可语音操作的元素的视图描述信息对用户的语音信息进行语义识别,得到用户的操作意图。
元素的视图描述信息包括与视图具体构成相关的元素名称、文本标签和元 素在视图上的坐标分布等展现场景信息,服务器根据元素的视图描述信息对用户的语音信息进行语义识别,可以将用户语音中关键信息与视图中元素进行匹配,得到符合当前显示视图的用户操作意图。
S130、依据可语音操作的元素的语音指令描述信息,从语音指令列表中定位出与用户操作意图相匹配的指令序列。
当用户的操作意图确定之后,通过用户操作意图和可语音操作的元素的语音指令描述信息之间的匹配,便可以定位出需要执行的指令序列。
在上述技术方案基础上,可选地,服务器获取的终端当前显示视图中可语音操作的元素的语音指令描述信息还包括语音标签,其中,语音标签设置为描述视图上可语音操作的元素信息。
语音标签可以设置为辅助识别和理解视图内容,更准确地找到对应的语音指令。根据视图布局的复杂程度,语音编程语言的开发人员可以对语音标签进行适应性设置。例如,对于简单的视图,语音指令与视图中的操作存在确定的一对一关系,可以不设置语音标签;对于复杂的视图,考虑可能会存在语音指令相同,但是实质对应视图中的不同操作元素的情形,便需要设置语音标签。例如,用户通过语音进行购票时,若用户的语音信息是买地点X到地点Y的票,无论是火车票或者飞机票,在视图中,都需要在地址输入框中输入起始地和终点,在时间框中选择出发时间等,这些操作对应的语音指令是相同的,这时便可以利用语音标签来做出区分。当用户说购买飞机票,则可以根据对应购买飞机票的可操作元素的语音标签,定位出具体要执行的购买地点X到地点Y的飞机票的语音指令。
此外,对于不同的视图,也存在相同的语音指令对应不同操作的情况,例 如,视图B和视图K中均有点击操作,对应于语音指令是提交,在视图B中提交对应的是暂停操作,而在视图K中提交对应的是列表选择操作,这时,通过开发人员基于本实施例中的提出的语音编程语言,在语音交互功能开发过程中配置上语音标签,添加对应的语音指令,即可实现对不同视图中的语音指令的功能区分,而不需要对视图B和视图K分别进行单独的语音交互功能开发,减少了语音交互功能开发的难度,增加了语音交互的通用性,可以快速扩展到不同的应用场景。
需要说明的是,目前,语音交互的实现通常是将用户输入的语音信息进行语义识别后与展示页面的可控控件对象的信息进行匹配,触发对应的页面操作,其中的可控控件对象并没有被语音指令化,没有统一形成标准化的语音编程语言;尤其是对于视图的可语音操作的元素的数据描述,需要前端研发工程师和策略研发工程师一起,把对应的数据和语音操作内容,通过代码逐个实现,对于后续的升级与迭代都非常复杂。
而本实施例中是将用户输入的语音信息与视图元素语音指令化后的语音指令描述信息进行相匹配,其中,语音指令描述信息是基于标准化的语音编程语言设置的语音指令及其配置信息。相当于,本实施例技术方案在指令匹配过程中服务器将用户输入的语音信息直接与视图元素指令化后的具体指令和指令配置信息相对应,开发人员不需要具体处理视图中的组件,只需要关注视图上可语音操作的元素对应的语音指令及指令配置信息,实现了语音交互功能和产品的业务逻辑的解耦合,便于分开进行维护。显然的,通常的方案并没有实现此解耦合功能,语音交互开发中仍需要开发人员针对要实现的语音交互功能同时对视图组件进行处理。
S140、将定位的指令序列下发到终端执行。
服务器将与用户操作意图相匹配的指令序列下发到终端,终端根据接收的指令序列执行对应的操作,完成用户的需求。
本实施例技术方案基于标准化的语音编程语言实现,首先获取用户的语音信息,和终端当前显示视图中可语音操作的元素的语音指令描述信息;依据可语音操作的元素的视图描述信息对用户的语音信息进行语义识别,得到用户的操作意图;进而从语音指令列表中定位出与用户操作意图相匹配的指令序列;最后将该指令序列下发到终端执行。本实施例解决了语音交互功能和产品的业务逻辑耦合度较高、语音交互功开发缺乏统一性以及通用性较差的问题,在语音交互功能开发过程中,开发人员对视图上可语音操作的元素进行语音指令描述信息的配置即可,尤其是可以标签的形式添加语音指令,实现了语音交互功能和产品的业务逻辑的解耦合,便于分开进行维护;实现了语音交互功能开发的统一性和标准化,进而提高了语音交互功能的通用性,可以快速扩展到不同的应用场景。
实施例二
图2是本申请实施例二提供的基于视图的语音交互方法的流程图,本实施例是在上述实施例的基础上进一步进行优化。如图2所示,该方法具体包括:
S210、获取用户的语音信息,和终端当前显示视图中可语音操作的元素的语音指令描述信息。
S220、根据可语音操作的元素的视图描述信息对用户的语音信息进行语音识别,得到对应的查询文本。
可选地,根据可语音操作的元素的视图描述信息对用户的语音信息进行语音识别,得到对应的查询文本,包括:
利用预先训练的声学模型预测用户语音信息的音频信号的声学特征;
利用预先训练的语言模型,基于可语音操作的元素的视图描述信息对预测得到的声学特征进行动态解码,生成对应的查询文本。
其中,由于语言文字的丰富性,经常会出现同音多义等情形,因此,结合视图描述信息对声学特征进行动态解码,即结合视图结构及视图中各元素之间的关系对声学特征进行动态解码,从而有针对性地识别出语音信息对应的查询文本,从而更精准地识别用户意图。
服务器可以利用声学模型和语言模型,通过特征预测与动态解码,生成用户语音信息对应的查询文本,也可以利用本领域中的其他语音识别方法得到查询文本,本实施例对此不做限制。声学模型包括但不限于隐马尔科夫模型,动态解码也可以利用语音解码器实现。
S230、从可语音操作的元素的视图描述信息中抽取出可语音操作的元素的文本标签,其中,文本标签包括可语音操作的元素的类型和属性。
根据视图的构建,视图元素可以包括多种类型,文本标签可以实现对视图元素的区分。服务器在对用户的语音信息进行语音识别的过程中,可以同时抽取出元素的文本标签,以便对用户语音信息对应的查询文本进行语义标注,从而结合视图显示的内容更好地理解用户的意图。
需要说明的是,操作S220和操作S230的执行顺序不加限定,只要保证在进行语义标注之前,成功得到用户语音信息对应的查询文本和元素的文本标签即可。
S240、利用预先训练的标注模型,根据抽取的文本标签对查询文本进行语义标注,得到查询文本的语义标注结果,即作为用户的操作意图。
声学模型、语言模型和标注模型在语义识别过程中可以进行周期性地更新,以确保语义识别的准确性。服务器利用标注模型得到用户的操作意图,便可以在语音指令列表中进行语音指令的确定。
S250、根据预先配置的语义与语音指令的对应关系和语音指令描述信息,从语音指令列表中定位出与用户操作意图相匹配的指令序列,其中,指令序列中包括至少一个语音指令的ID及其配置信息中的关键值。
语音编程语言的开发人员在语音功能的开发过程中,预先配置语义与语音指令之间的对应关系,当确定用户的操作意图之后,利用该对应关系和视图上可语音操作元素的语音指令描述信息逐步定位出语音指令,进而形成包括语音指令的ID和指令配置信息的关键值的指令序列,其中,语音指令的ID是语音指令的唯一标识,如可以用来标识每一个语音指令在序列中的位置,关键值用来标识语音指令对应的具体执行特征。例如,对于提交操作,包括确认或取消两种情况,对应的语音指令是提交,指令配置信息的关键值便是确认(OK)或取消(cancel);对于快进到2分30s的播放操作,对应的语音指令是快进,指令配置信息的关键值是2分30s;又如,当前视图上显示的内容是演员A的电影排行榜,排行分类包括最热、最新和好评,视图中的语音指令包括三个列表切换(listchage)指令,列表切换指令的配置信息中的关键值分别是:最热、最新和好评。
可选地,定位指令序列的过程可以包括:利用预先配置的语义与语音指令的对应关系,从语音指令列表中确定出目标语音指令集合;根据语音指令描述 信息,例如语音标签和指令配置信息的关键值,从该目标语音指令集合中定位出与用户操作意图匹配的指令序列。
示例性的,终端当前视图上显示的音乐播放器的主界面,当用户输出的语音信息为“想听勇气、后来、当爱已成往事”等多首歌曲时,服务器根据识别出的用户操作意图,首先确定当前视图中选择歌曲操作的语音指令是选择,根据语音标签-歌单,确定出包括多个歌曲名称的目标语音指令集合,在目标语音指令集合中,每一个歌曲名称对应一个选择的语音子指令;然后根据用户语音信息中具体的歌曲名称,分别将勇气、后来、当爱已成往事这三个歌曲名称作为语音指令的配置信息的关键值,从目标语音指令集合中确定出选择勇气、后来、当爱已成往事这三首歌曲的语音指令序列。此外,根据语音标签-歌单,也可以不确定目标语音指令集合,而是直接根据用户语音信息中的歌曲名称,逐个确定选择具体歌曲名称的语音指令,然后以列表形式下发至终端。
S260、将定位的指令序列下发到终端执行。
本实施例的技术方案通过获取用户的语音信息,和终端当前显示视图中可语音操作的元素的语音指令描述信息;依据可语音操作的元素的视图描述信息对用户的语音信息依次进行语音识别与语义标注后得到用户的操作意图;进而从语音指令列表中定位出与用户操作意图相匹配的指令序列;最后将该指令序列下发到终端执行。本实施例解决了语音交互功能和产品的业务逻辑耦合度较高、语音交互功开发缺乏统一性以及通用性较差的问题,在语音交互功能开发过程中,开发人员对视图上可语音操作的元素进行语音指令描述信息的配置即可,实现了语音交互功能和产品的业务逻辑的解耦合,便于分开进行维护;实现了语音交互功能开发的统一性和标准化,进而提高了语音交互功能的通用性, 可以快速扩展到不同的应用场景。
实施例三
图3是本申请实施例三提供的基于视图的语音交互方法的流程图,本实施例可适用于应用在终端中实现基于视图的语音交互的情况,与本申请上述实施例中应用于服务器的基于视图的语音交互方法配合执行。该方法可以由基于视图的语音交互装置来执行,该装置可以采用软件和/或硬件的方式实现,并可集成在终端中,例如手机、平板电脑和个人电脑等智能终端。如图3所示,该方法具体包括:
S310、将监听到的用户的语音信息和终端当前显示视图中可语音操作的元素的语音指令描述信息发送至服务器,其中,语音指令描述信息包括语音指令列表和语音指令列表中每个语音指令的配置信息,语音指令设置为描述对视图中元素可执行的语音操作。
可选地,语音指令描述信息还包括语音标签,其中,语音标签设置为描述视图上可语音操作的元素信息。
终端对用户的语音信息进行监听,具体可以通过麦克风或者与终端连接的外部声音采集装置等采集用户的语音信息,然后,发送至服务器。在终端处理器监听到用户的语音输入事件时,终端处理器同时将当前显示视图中可语音操作的元素的语音指令描述信息发送至服务器。终端与服务器之间可以通过网络通信实现信息与数据的共享。
需要说明的是,当终端监听到语音信息后,需要对当前显示视图上的语音指令进行初始化,其流程可以包括三个环节:语音对象初始化、语音指令初始 化和语音指令数据收集。示例性地,语音对象初始化包括监听用户语音输入、注册语音对象配置和初始化视图页面语音对象;语音指令初始化包括视图的文档对象模型(Document Object Model,DOM)解析、构建指令配置和初始化指令配置;语音指令数据收集包括:配置数据提供指令、构建指令处理器和更新数据信息。
S320、接收来自服务器根据用户语音信息和语音指令描述信息确定的指令序列,并执行指令序列中语音指令对应的指令处理逻辑。
当服务器根据用户语音信息和可语音操作的元素的语音指令描述信息,进行指令分析,确定出匹配的指令序列之后,将指令序列下发到终端。终端接收到服务器的响应,把对应的指令序列发送到终端的指令路由器。指令路由器根据指令序列,决策到要执行的语音指令,并初始化对应的语音指令,然后执行对应的指令处理逻辑。
可选地,执行指令序列中语音指令对应的指令处理逻辑,包括:
执行指令序列中语音指令对应的指令处理逻辑,并在执行过程中,获取与指令处理逻辑对应的语音事件,执行该语音事件,其中,语音事件设置为定义语音指令执行过程中需要处理的产品逻辑。
语音事件可以设置为根据视图上具体的指令处理逻辑,定义个性化的产品逻辑,例如语音指令的执行方式或者产品展示效果等。例如,当前视图上显示的内容是演员A的最热电影排行榜,排行分类还包括最新和好评,视图中的语音指令包括三个列表切换(listchage)指令,列表切换指令的配置信息中的关键值分别是:最热、最新和好评,当用户输入的语音信息是:想看演员A的电影好评排行榜上的第二个电影时,终端根据接收到的指令序列,将当前显示的最热 电影排行榜切换到好评电影排行榜,同时锁定好评电影排行榜上的第二个电影进行播放,在播放之前,可以根据与播放第二个电影相关的语音事件,例如将第二个电影的海报进行特效显示,具体为将该电影的海报图标进行放大与高亮展示,然后进行电影的播放。因此,语音事件的设置可以增加语音交互功能的多样化和趣味性,使用户产生更好的产品使用体验。
本实施例技术方案通过将监听到的用户的语音信息和终端当前显示视图中可语音操作的元素的语音指令描述信息发送至服务器,然后接收来自服务器根据用户语音信息和语音指令描述信息确定的指令序列,并执行对应处理逻辑,解决了语音交互功能和产品的业务逻辑耦合度较高、语音交互功开发缺乏统一性以及通用性较差的问题,实现了语音交互功能和产品的业务逻辑的解耦合;实现了语音交互功能的标准化,进而提高了语音交互功能的通用性,可以快速扩展到不同的应用场景。
实施例四
图4是本申请实施例四提供的基于视图的语音交互方法的流程图,本实施例是在上述实施例的基础上进一步进行优化。如图4所示,该方法具体包括:
S410、将监听到的用户的语音信息和终端当前显示视图中可语音操作的元素的语音指令描述信息发送至服务器,其中,语音指令描述信息包括语音指令列表和语音指令列表中每个语音指令的配置信息,语音指令设置为描述对视图中元素可执行的语音操作。
S420、接收来自服务器根据用户语音信息和语音指令描述信息确定的指令序列,其中,指令序列中包括至少一个语音指令的ID及其配置信息中的关键值。
S430、如果指令序列中包括一个语音指令的ID及其配置信息中的关键值,则根据语音指令ID和关键值执行对应的指令处理逻辑。
当用户输入的语音信息与可语音操作的元素对应的语音指令存在一对一的对应关系时,便可以根据用户的当前语音信息匹配出唯一的语音指令,不需要与用户进行多次的交互。例如,用户的语音信息时确认,对应的语音指令是提交,语音指令的关键值即确认(OK),终端根据提交指令执行确认操作。
S440、如果指令序列中包括两个以上语音指令的ID及其配置信息中的关键值,则通过与终端交互确定指令序列中的目标语音指令,并根据目标语音指令的ID和关键值执行对应的指令处理逻辑。
当用户输入的语音信息与可语音操作的元素对应的语音指令存在一对多的对应关系时,便需要通过用户与终端的交互,确定最终的目标语音指令。例如,在当前显示的播放器视图中,用户输入的语音信息是听歌,对应的语音指令是选择,根据语音标签-歌单,可以确定出包括多首歌曲的播放语音指令的指令序列,这时需要用户继续输入关于歌曲名称R的语音信息,才可以确定出用户具体要听的歌曲名称R的播放语音指令,进而终端根据该语音指令播放歌曲R。
本实施例技术方案通过将监听到的用户的语音信息和终端当前显示视图中可语音操作的元素的语音指令描述信息发送至服务器,并接收来自服务器根据用户语音信息和语音指令描述信息确定的指令序列,最后根据指令序列中包括的指令数量,通过与用户的交互确定目标语音指令,并执行对应处理逻辑,解决了语音交互功能和产品的业务逻辑耦合度较高、语音交互功开发缺乏统一性以及通用性较差的问题,实现了语音交互功能和产品的业务逻辑的解耦合;实现了语音交互功能的标准化,进而提高了语音交互功能的通用性,可以快速扩 展到不同的应用场景。
以下是本申请实施例提供的基于视图的语音交互装置的实施例,该装置与上述各实施例的基于视图的语音交互方法属于同一个申请构思,在基于视图的语音交互装置的实施例中未详尽描述的细节内容,可以参考上述基于视图的语音交互方法的实施例。
实施例五
图5是本申请实施例五提供的基于视图的语音交互装置的结构示意图,可配置于服务器中,本实施例可适用于实现基于视图的语音交互的情况。本申请实施例所提供的基于视图的语音交互装置可执行本申请任意实施例所提供的应用于服务器的基于视图的语音交互方法,具备执行方法相应的功能模块和有益效果。如图5所示,该装置具体包括语音及指令信息获取模块510、语义识别模块520、指令序列确定模块530和指令下发模块540,其中:
语音及指令信息获取模块510,设置为获取用户的语音信息,和终端当前显示视图中可语音操作的元素的语音指令描述信息,其中,语音指令描述信息包括语音指令列表和语音指令列表中每个语音指令的配置信息,语音指令设置为描述对视图中元素可执行的语音操作。
可选地,语音及指令信息获取模块510中获取的语音指令描述信息中还包括语音标签,其中,语音标签设置为描述视图上可语音操作的元素信息。
语义识别模块520,设置为依据可语音操作的元素的视图描述信息对用户的语音信息进行语义识别,得到用户的操作意图。
指令序列确定模块530,设置为依据可语音操作的元素的语音指令描述信 息,从语音指令列表中定位出与用户操作意图相匹配的指令序列。
可选地,指令序列确定模块530具体设置为:
根据预先配置的语义与语音指令的对应关系和语音指令描述信息,从语音指令列表中定位出与用户操作意图相匹配的指令序列,其中,指令序列中包括至少一个语音指令的ID及其配置信息中的关键值。
指令下发模块540,设置为将定位的指令序列下发到终端执行。
在上述技术方案基础上,可选地,语义识别模块520包括查询文本确定单元、文本标签抽取单元和操作意图确定单元,其中:
查询文本确定单元,设置为根据可语音操作的元素的视图描述信息对用户的语音信息进行语音识别,得到对应的查询文本;
文本标签抽取单元,设置为从可语音操作的元素的视图描述信息中抽取出可语音操作的元素的文本标签,其中,文本标签包括可语音操作的元素的类型和属性;
操作意图确定单元,设置为利用预先训练的标注模型,根据抽取的文本标签对查询文本进行语义标注,得到查询文本的语义标注结果,即作为用户的操作意图。
可选地,查询文本确定单元包括声学特征预测子单元和文本生成子单元,其中:
声学特征预测子单元,设置为利用预先训练的声学模型预测用户语音信息的音频信号的声学特征;
文本生成子单元,设置为利用预先训练的语言模型,基于可语音操作的元素的视图描述信息对预测得到的声学特征进行动态解码,生成对应的查询文本。
本实施例技术方案基于标准化的语音编程语言实现,首先获取用户的语音信息,和终端当前显示视图中可语音操作的元素的语音指令描述信息;依据可语音操作的元素的视图描述信息对用户的语音信息进行语义识别,得到用户的操作意图;进而从语音指令列表中定位出与用户操作意图相匹配的指令序列;最后将该指令序列下发到终端执行。本实施例解决了语音交互功能和产品的业务逻辑耦合度较高、语音交互功开发缺乏统一性以及通用性较差的问题,在语音交互功能开发过程中,开发人员对视图上可语音操作的元素进行语音指令描述信息的配置即可,实现了语音交互功能和产品的业务逻辑的解耦合,便于分开进行维护;实现了语音交互功能开发的统一性和标准化,进而提高了语音交互功能的通用性,可以快速扩展到不同的应用场景。
实施例六
图6是本申请实施例六提供的基于视图的语音交互装置的结构示意图,可配置于终端中,本实施例可适用于实现基于视图的语音交互的情况。本申请实施例所提供的基于视图的语音交互装置可执行本申请任意实施例所提供的应用于终端的基于视图的语音交互方法,具备执行方法相应的功能模块和有益效果。如图6所示,该装置具体包括语音及指令信息发送模块610和指令序列执行模块620,其中:
语音及指令信息发送模块610,设置为将监听到的用户的语音信息和终端当前显示视图中可语音操作的元素的语音指令描述信息发送至服务器,其中,语音指令描述信息包括语音指令列表和语音指令列表中每个语音指令的配置信息,语音指令设置为描述对视图中元素可执行的语音操作。
可选地,语音及指令信息发送模块610中发送的语音指令描述信息中还包括语音标签,其中,语音标签设置为描述视图上可语音操作的元素信息。
指令序列执行模块620,设置为接收来自服务器根据用户语音信息和语音指令描述信息确定的指令序列,并执行指令序列中语音指令对应的指令处理逻辑。
可选地,指令序列执行模块620包括接收单元和执行单元,其中:
接收单元,设置为接收来自服务器根据用户语音信息和语音指令描述信息确定的指令序列;
执行单元,设置为执行接收的指令序列中语音指令对应的指令处理逻辑。
可选地,执行单元具体设置为:
执行所述指令序列中语音指令对应的指令处理逻辑,并在执行过程中,获取与所述指令处理逻辑对应的语音事件,执行所述语音事件,其中,所述语音事件设置为定义语音指令执行过程中需要处理的产品逻辑。
可选地,指令序列执行模块620中接收的指令序列中包括至少一个语音指令的ID及其配置信息中的关键值;
相应的,执行单元包括第一执行子单元和第二执行子单元,其中:
第一执行子单元,设置为如果指令序列中包括一个语音指令的ID及其配置信息中的关键值,则根据语音指令ID和关键值执行对应的指令处理逻辑;
第二执行子单元,设置为如果指令序列中包括两个以上语音指令的ID及其配置信息中的关键值,则通过与终端交互确定指令序列中的目标语音指令,并根据目标语音指令的ID和关键值执行对应的指令处理逻辑。
本实施例技术方案通过将监听到的用户的语音信息和终端当前显示视图中 可语音操作的元素的语音指令描述信息发送至服务器,然后接收来自服务器根据用户语音信息和语音指令描述信息确定的指令序列,并执行对应处理逻辑,解决了语音交互功能和产品的业务逻辑耦合度较高、语音交互功开发缺乏统一性以及通用性较差的问题,实现了语音交互功能和产品的业务逻辑的解耦合;实现了语音交互功能的标准化,进而提高了语音交互功能的通用性,可以快速扩展到不同的应用场景。
实施例七
图7是本申请实施例七提供的一种服务器的结构示意图。图7示出了适于用来实现本申请实施方式的示例性服务器712的框图。图7显示的服务器712仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图7所示,服务器712以通用服务器的形式表现。服务器712的组件可以包括但不限于:一个或者多个处理器716,存储装置728,连接不同系统组件(包括存储装置728和处理器716)的总线718。
总线718表示几类总线结构中的一种或多种,包括存储装置总线或者存储装置控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(Industry Subversive Alliance,ISA)总线,微通道体系结构(Micro Channel Architecture,MAC)总线,增强型ISA总线、视频电子标准协会(Video Electronics Standards Association,VESA)局域总线以及外围组件互连(Peripheral Component Interconnect,PCI)总线。
服务器712典型地包括多种计算机系统可读介质。这些介质可以是任何能 够被服务器712访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。
存储装置728可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(Random Access Memory,RAM)730和/或高速缓存存储器732。服务器712可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统734可以设置为读写不可移动的、非易失性磁介质(图7未显示,通常称为“硬盘驱动器”)。尽管图7中未示出,可以提供设置为对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘,例如只读光盘(Compact Disc Read-Only Memory,CD-ROM),数字视盘(Digital Video Disc-Read Only Memory,DVD-ROM)或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线718相连。存储装置728可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本申请各实施例的功能。
具有一组(至少一个)程序模块742的程序/实用工具740,可以存储在例如存储装置728中,这样的程序模块742包括但不限于操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块742通常执行本申请所描述的实施例中的功能和/或方法。
服务器712也可以与一个或多个外部设备714(例如键盘、指向终端、显示器724等)通信,还可与一个或者多个使得用户能与该服务器712交互的终端通信,和/或与使得该服务器712能与一个或多个其它计算终端进行通信的任 何终端(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口722进行。并且,服务器712还可以通过网络适配器720与一个或者多个网络(例如局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN)和/或公共网络,例如因特网)通信。如图7所示,网络适配器720通过总线718与服务器712的其它模块通信。应当明白,尽管图中未示出,可以结合服务器712使用其它硬件和/或软件模块,包括但不限于:微代码、终端驱动器、冗余处理器、外部磁盘驱动阵列、磁盘阵列(Redundant Arrays of Independent Disks,RAID)系统、磁带驱动器以及数据备份存储系统等。
处理器716通过运行存储在存储装置728中的程序,从而执行各种功能应用以及数据处理,例如实现本申请实施例所提供的应用于服务器的基于视图的语音交互方法,该方法包括:
获取用户的语音信息,和终端当前显示视图中可语音操作的元素的语音指令描述信息,其中,语音指令描述信息包括语音指令列表和语音指令列表中每个语音指令的配置信息,语音指令设置为描述对视图中元素可执行的语音操作;
依据可语音操作的元素的视图描述信息对用户的语音信息进行语义识别,得到用户的操作意图;
依据可语音操作的元素的语音指令描述信息,从语音指令列表中定位出与用户操作意图相匹配的指令序列;
将定位的指令序列下发到终端执行。
实施例八
图8是本申请实施例八提供的一种终端的结构示意图。图8示出了适于用 来实现本申请实施方式的示例性终端812的框图。图8显示的终端812仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图8所示,终端812以通用终端的形式表现。终端812的组件可以包括但不限于:一个或者多个处理器816,存储装置828,连接不同系统组件(包括存储装置828和处理器816)的总线818。
总线818表示几类总线结构中的一种或多种,包括存储装置总线或者存储装置控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(Industry Subversive Alliance,ISA)总线,微通道体系结构(Micro Channel Architecture,MAC)总线,增强型ISA总线、视频电子标准协会(Video Electronics Standards Association,VESA)局域总线以及外围组件互连(Peripheral Component Interconnect,PCI)总线。
终端812典型地包括多种计算机系统可读介质。这些介质可以是任何能够被终端812访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。
存储装置828可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(Random Access Memory,RAM)830和/或高速缓存存储器832。终端812可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统834可以设置为读写不可移动的、非易失性磁介质(图8未显示,通常称为“硬盘驱动器”)。尽管图8中未示出,可以提供设置为对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘,例如只读光盘(Compact Disc Read-Only Memory,CD-ROM), 数字视盘(Digital Video Disc-Read Only Memory,DVD-ROM)或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线818相连。存储装置828可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本申请各实施例的功能。
具有一组(至少一个)程序模块842的程序/实用工具840,可以存储在例如存储装置828中,这样的程序模块842包括但不限于操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块842通常执行本申请所描述的实施例中的功能和/或方法。
终端812也可以与一个或多个外部设备814(例如键盘、指向终端、显示器824等)通信,还可与一个或者多个使得用户能与该终端812交互的终端通信,和/或与使得该终端812能与一个或多个其它计算终端进行通信的任何终端(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口822进行。并且,终端812还可以通过网络适配器820与一个或者多个网络(例如局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN)和/或公共网络,例如因特网)通信。如图8所示,网络适配器820通过总线818与终端812的其它模块通信。应当明白,尽管图中未示出,可以结合终端812使用其它硬件和/或软件模块,包括但不限于:微代码、终端驱动器、冗余处理器、外部磁盘驱动阵列、磁盘阵列(Redundant Arrays of Independent Disks,RAID)系统、磁带驱动器以及数据备份存储系统等。
处理器816通过运行存储在存储装置828中的程序,从而执行各种功能应 用以及数据处理,例如实现本申请实施例所提供的应用于终端的基于视图的语音交互方法,该方法包括:
将监听到的用户的语音信息和终端当前显示视图中可语音操作的元素的语音指令描述信息发送至服务器,其中,语音指令描述信息包括语音指令列表和语音指令列表中每个语音指令的配置信息,语音指令设置为描述对视图中元素可执行的语音操作;
接收来自服务器根据用户语音信息和语音指令描述信息确定的指令序列,并执行指令序列中语音指令对应的指令处理逻辑。
实施例九
本申请实施例九还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请实施例所提供的应用于服务器的基于视图的语音交互方法,该方法包括:
获取用户的语音信息,和终端当前显示视图中可语音操作的元素的语音指令描述信息,其中,语音指令描述信息包括语音指令列表和语音指令列表中每个语音指令的配置信息,语音指令设置为描述对视图中元素可执行的语音操作;
依据可语音操作的元素的视图描述信息对用户的语音信息进行语义识别,得到用户的操作意图;
依据可语音操作的元素的语音指令描述信息,从语音指令列表中定位出与用户操作意图相匹配的指令序列;
将定位的指令序列下发到终端执行。
本申请实施例的计算机存储介质,可以采用一个或多个计算机可读的介质 的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括——但不限于无线、电线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言或其组合来编写用于执行本申请操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机 上执行、或者完全在远程计算机或终端上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
本申请实施例还提供了另一种计算机可读存储介质,其上存储的计算机程序在由处理器执行时可实现一种应用于终端的基于视图的语音交互方法,该方法包括:
将监听到的用户的语音信息和终端当前显示视图中可语音操作的元素的语音指令描述信息发送至服务器,其中,语音指令描述信息包括语音指令列表和语音指令列表中每个语音指令的配置信息,语音指令设置为描述对视图中元素可执行的语音操作;
接收来自服务器根据用户语音信息和语音指令描述信息确定的指令序列,并执行指令序列中语音指令对应的指令处理逻辑。
当然,本申请实施例所提供的一种计算机可读存储介质,其计算机程序不限于如上所述的方法操作,还可以执行本申请任意实施例所提供的应用于终端的基于视图的语音交互方法的相关操作。对存储介质的介绍可参见实施例九中的内容解释。
注意,虽然通过以上实施例对本申请进行了较为详细的说明,但是本申请不仅仅限于以上实施例,在不脱离本申请构思的情况下,还可以包括更多其他等效实施例,而本申请的范围由所附的权利要求范围决定。
Claims (15)
- 一种基于视图的语音交互方法,应用于服务器,所述方法包括:获取用户的语音信息,和终端当前显示视图中可语音操作的元素的语音指令描述信息,其中,所述语音指令描述信息包括语音指令列表和所述语音指令列表中每个语音指令的配置信息,所述语音指令设置为描述对视图中元素可执行的语音操作;依据所述可语音操作的元素的视图描述信息对所述语音信息进行语义识别,得到用户的操作意图;依据所述语音指令描述信息,从所述语音指令列表中定位出与所述操作意图相匹配的指令序列;将所述指令序列下发到终端执行。
- 根据权利要求1所述的方法,其中,所述语音指令描述信息还包括语音标签,其中,所述语音标签设置为描述视图上可语音操作的元素信息。
- 根据权利要求1或2所述的方法,其中,依据所述可语音操作的元素的视图描述信息对所述语音信息进行语义识别,得到用户的操作意图,包括:根据所述可语音操作的元素的视图描述信息对所述语音信息进行语音识别,得到对应的查询文本;从所述可语音操作的元素的视图描述信息中抽取出所述可语音操作的元素的文本标签,其中,所述文本标签包括所述可语音操作的元素的类型和属性;利用预先训练的标注模型,根据所述文本标签对所述查询文本进行语义标注,得到所述查询文本的语义标注结果,即作为用户的操作意图。
- 根据权利要求3所述的方法,其中,根据所述可语音操作的元素的视图描述信息对所述语音信息进行语音识别,得到对应的查询文本,包括:利用预先训练的声学模型预测所述语音信息的音频信号的声学特征;利用预先训练的语言模型,基于所述可语音操作的元素的视图描述信息对所述声学特征进行动态解码,生成对应的查询文本。
- 根据权利要求1或2所述的方法,其中,依据所述语音指令描述信息,从所述语音指令列表中定位出与所述操作意图相匹配的指令序列,包括:根据预先配置的语义与语音指令的对应关系和所述语音指令描述信息,从所述语音指令列表中定位出与所述操作意图相匹配的指令序列,其中,所述指令序列中包括至少一个语音指令的ID及其配置信息中的关键值。
- 一种基于视图的语音交互方法,应用于终端,所述方法包括:将监听到的用户的语音信息和终端当前显示视图中可语音操作的元素的语音指令描述信息发送至服务器,其中,所述语音指令描述信息包括语音指令列表和语音指令列表中每个语音指令的配置信息,所述语音指令设置为描述对视图中元素可执行的语音操作;接收来自服务器根据所述语音信息和语音指令描述信息确定的指令序列,并执行所述指令序列中语音指令对应的指令处理逻辑。
- 根据权利要求6所述的方法,其中,所述语音指令描述信息还包括语音标签,其中,所述语音标签设置为描述视图上可语音操作的元素信息。
- 根据权利要求6或7所述的方法,其中,所述指令序列中包括至少一个语音指令的ID及其配置信息中的关键值;相应的,执行所述指令序列中语音指令对应的指令处理逻辑,包括:如果所述指令序列中包括一个语音指令的ID及其配置信息中的关键值,则根据所述ID和关键值执行对应的指令处理逻辑;如果所述指令序列中包括两个以上语音指令的ID及其配置信息中的关键值,则通过与终端交互确定所述指令序列中的目标语音指令,并根据目标语音指令的ID和关键值执行对应的指令处理逻辑。
- 根据权利要求6或7所述的方法,其中,执行所述指令序列中语音指令对应的指令处理逻辑,包括:执行所述指令序列中语音指令对应的指令处理逻辑,并在执行过程中,获取与所述指令处理逻辑对应的语音事件,执行所述语音事件,其中,所述语音事件设置为定义语音指令执行过程中需要处理的产品逻辑。
- 一种基于视图的语音交互装置,配置于服务器,所述装置包括:语音及指令信息获取模块,设置为获取用户的语音信息,和终端当前显示视图中可语音操作的元素的语音指令描述信息,其中,所述语音指令描述信息包括语音指令列表和所述语音指令列表中每个语音指令的配置信息,所述语音指令设置为描述对视图中元素可执行的语音操作;语义识别模块,设置为依据所述可语音操作的元素的视图描述信息对所述语音信息进行语义识别,得到用户的操作意图;指令序列确定模块,设置为依据所述语音指令描述信息,从所述语音指令列表中定位出与所述操作意图相匹配的待执行的指令序列;指令下发模块,设置为将所述待执行的指令序列下发到终端执行。
- 一种基于视图的语音交互装置,配置于终端,所述装置包括:语音及指令信息发送模块,设置为将监听到的用户的语音信息和终端当前显示视图中可语音操作的元素的语音指令描述信息发送至服务器,其中,所述语音指令描述信息包括语音指令列表和语音指令列表中每个语音指令的配置信 息,所述语音指令设置为描述对视图中元素可执行的语音操作;指令序列执行模块,设置为接收来自服务器根据所述语音信息和语音指令描述信息确定的指令序列,并执行所述指令序列中语音指令对应的指令处理逻辑。
- 一种服务器,包括:一个或多个处理器;存储装置,设置为存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1~5中任一所述的基于视图的语音交互方法。
- 一种终端,包括:一个或多个处理器;存储装置,设置为存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求6~9中任一所述的基于视图的语音交互方法。
- 一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1~5中任一所述的基于视图的语音交互方法。
- 一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求6~9中任一所述的基于视图的语音交互方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2020502486A JP6952184B2 (ja) | 2018-05-23 | 2019-01-18 | ビューに基づく音声インタラクション方法、装置、サーバ、端末及び媒体 |
US16/888,426 US11727927B2 (en) | 2018-05-23 | 2020-05-29 | View-based voice interaction method, apparatus, server, terminal and medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810501073.7A CN108877791B (zh) | 2018-05-23 | 2018-05-23 | 基于视图的语音交互方法、装置、服务器、终端和介质 |
CN201810501073.7 | 2018-05-23 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/888,426 Continuation US11727927B2 (en) | 2018-05-23 | 2020-05-29 | View-based voice interaction method, apparatus, server, terminal and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019223351A1 true WO2019223351A1 (zh) | 2019-11-28 |
Family
ID=64333119
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/072339 WO2019223351A1 (zh) | 2018-05-23 | 2019-01-18 | 基于视图的语音交互方法、装置、服务器、终端和介质 |
Country Status (4)
Country | Link |
---|---|
US (1) | US11727927B2 (zh) |
JP (1) | JP6952184B2 (zh) |
CN (1) | CN108877791B (zh) |
WO (1) | WO2019223351A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111767021A (zh) * | 2020-06-28 | 2020-10-13 | 广州小鹏车联网科技有限公司 | 语音交互方法、车辆、服务器、系统和存储介质 |
EP3958110A1 (en) * | 2020-08-17 | 2022-02-23 | Beijing Xiaomi Pinecone Electronics Co., Ltd. | Speech control method and apparatus, terminal device, and storage medium |
Families Citing this family (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108877791B (zh) * | 2018-05-23 | 2021-10-08 | 百度在线网络技术(北京)有限公司 | 基于视图的语音交互方法、装置、服务器、终端和介质 |
CN111383631B (zh) * | 2018-12-11 | 2024-01-23 | 阿里巴巴集团控股有限公司 | 一种语音交互方法、装置及系统 |
CN111415656B (zh) * | 2019-01-04 | 2024-04-30 | 上海擎感智能科技有限公司 | 语音语义识别方法、装置及车辆 |
CN111427529B (zh) * | 2019-01-09 | 2023-05-30 | 斑马智行网络(香港)有限公司 | 交互方法、装置、设备及存储介质 |
CN109947252A (zh) * | 2019-03-21 | 2019-06-28 | 百度在线网络技术(北京)有限公司 | 配置智能设备交互功能的方法和装置 |
CN111857635A (zh) * | 2019-04-30 | 2020-10-30 | 阿里巴巴集团控股有限公司 | 交互方法、存储介质、操作系统和设备 |
CN110162176B (zh) * | 2019-05-20 | 2022-04-26 | 北京百度网讯科技有限公司 | 语音指令的挖掘方法和装置终端、计算机可读介质 |
CN110290216B (zh) * | 2019-06-28 | 2022-05-13 | 百度在线网络技术(北京)有限公司 | 监听执行方法、指令下发方法、装置、设备及存储介质 |
EP4002087A4 (en) * | 2019-07-19 | 2023-04-12 | LG Electronics Inc. | DISPLAY DEVICE AND ARTIFICIAL INTELLIGENCE SERVER CAPABLE OF CONTROLLING A HOME APPLIANCE VIA A USER'S VOICE |
CN112306447A (zh) * | 2019-08-30 | 2021-02-02 | 北京字节跳动网络技术有限公司 | 一种界面导航方法、装置、终端和存储介质 |
CN110660391A (zh) * | 2019-09-29 | 2020-01-07 | 苏州思必驰信息科技有限公司 | 基于rpa接口实现大屏终端语音控制的定制方法及系统 |
CN112817553A (zh) * | 2019-11-15 | 2021-05-18 | 阿里巴巴集团控股有限公司 | 一种语音交互方法、装置及系统 |
CN112309388A (zh) * | 2020-03-02 | 2021-02-02 | 北京字节跳动网络技术有限公司 | 用于处理信息的方法和装置 |
CN113571062B (zh) * | 2020-04-28 | 2024-05-24 | 中国移动通信集团浙江有限公司 | 基于语音数据的客户标签识别方法、装置及计算设备 |
CN111611468B (zh) * | 2020-04-29 | 2023-08-25 | 百度在线网络技术(北京)有限公司 | 页面交互方法、装置和电子设备 |
CN111917513B (zh) * | 2020-07-29 | 2022-11-22 | 上海海洋大学 | 一种移动端与服务器端数据交互方法 |
CN114255745A (zh) * | 2020-09-10 | 2022-03-29 | 华为技术有限公司 | 一种人机交互的方法、电子设备及系统 |
CN112163086B (zh) * | 2020-10-30 | 2023-02-24 | 海信视像科技股份有限公司 | 多意图的识别方法、显示设备 |
CN112487142B (zh) * | 2020-11-27 | 2022-08-09 | 易联众信息技术股份有限公司 | 一种基于自然语言处理的对话式智能交互方法和系统 |
CN112579036B (zh) * | 2020-12-17 | 2024-07-19 | 南方电网数字平台科技(广东)有限公司 | 语音输入的报告设计器实现方法、系统、设备及存储介质 |
CN112860866B (zh) * | 2021-02-09 | 2023-09-19 | 北京百度网讯科技有限公司 | 语义检索方法、装置、设备以及存储介质 |
CN112885361A (zh) * | 2021-03-01 | 2021-06-01 | 长沙克莱自动化设备有限公司 | 语音控制方法、装置、电子设备和存储介质 |
CN112905149A (zh) * | 2021-04-06 | 2021-06-04 | Vidaa美国公司 | 显示设备上语音指令的处理方法、显示设备及服务器 |
CN113379975A (zh) * | 2021-06-09 | 2021-09-10 | 中国银行股份有限公司 | 一种自动取款机交互方法及相关设备 |
CN114047900A (zh) * | 2021-10-12 | 2022-02-15 | 中电金信软件有限公司 | 业务处理方法、装置、电子设备及计算机可读存储介质 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103377028A (zh) * | 2012-04-20 | 2013-10-30 | 纽安斯通讯公司 | 用于以语音启动人机界面的方法和系统 |
CN104205010A (zh) * | 2012-03-30 | 2014-12-10 | 英特尔公司 | 语音启用的触摸屏用户界面 |
CN105161106A (zh) * | 2015-08-20 | 2015-12-16 | 深圳Tcl数字技术有限公司 | 智能终端的语音控制方法、装置及电视机系统 |
CN106486118A (zh) * | 2016-09-30 | 2017-03-08 | 北京奇虎科技有限公司 | 一种应用的语音控制方法和装置 |
US20180096686A1 (en) * | 2016-10-04 | 2018-04-05 | Microsoft Technology Licensing, Llc | Combined menu-based and natural-language-based communication with chatbots |
CN107992587A (zh) * | 2017-12-08 | 2018-05-04 | 北京百度网讯科技有限公司 | 一种浏览器的语音交互方法、装置、终端和存储介质 |
CN108877791A (zh) * | 2018-05-23 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | 基于视图的语音交互方法、装置、服务器、终端和介质 |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1266625C (zh) | 2001-05-04 | 2006-07-26 | 微软公司 | 用于web启用的识别的服务器 |
DE10207895B4 (de) * | 2002-02-23 | 2005-11-03 | Harman Becker Automotive Systems Gmbh | Verfahren zur Spracherkennung und Spracherkennungssystem |
JP2006330576A (ja) * | 2005-05-30 | 2006-12-07 | Sharp Corp | 機器操作システム、音声認識装置、電子機器、情報処理装置、プログラム、及び記録媒体 |
US8538757B2 (en) * | 2007-05-17 | 2013-09-17 | Redstart Systems, Inc. | System and method of a list commands utility for a speech recognition command system |
CN103544954A (zh) * | 2012-07-17 | 2014-01-29 | 北京千橡网景科技发展有限公司 | 用于向语音消息添加文字标签的方法和装置 |
US10867597B2 (en) * | 2013-09-02 | 2020-12-15 | Microsoft Technology Licensing, Llc | Assignment of semantic labels to a sequence of words using neural network architectures |
US10203933B2 (en) * | 2014-11-06 | 2019-02-12 | Microsoft Technology Licensing, Llc | Context-based command surfacing |
US9966073B2 (en) * | 2015-05-27 | 2018-05-08 | Google Llc | Context-sensitive dynamic update of voice to text model in a voice-enabled electronic device |
US10261752B2 (en) * | 2016-08-02 | 2019-04-16 | Google Llc | Component libraries for voice interaction services |
CN107871496B (zh) * | 2016-09-23 | 2021-02-12 | 北京眼神科技有限公司 | 语音识别方法和装置 |
CN108062212A (zh) * | 2016-11-08 | 2018-05-22 | 沈阳美行科技有限公司 | 一种基于场景的语音操作方法及装置 |
US10467510B2 (en) * | 2017-02-14 | 2019-11-05 | Microsoft Technology Licensing, Llc | Intelligent assistant |
CN107180631A (zh) * | 2017-05-24 | 2017-09-19 | 刘平舟 | 一种语音交互方法及装置 |
CN107507615A (zh) * | 2017-08-29 | 2017-12-22 | 百度在线网络技术(北京)有限公司 | 界面智能交互控制方法、装置、系统及存储介质 |
US11182122B2 (en) * | 2017-12-08 | 2021-11-23 | Amazon Technologies, Inc. | Voice control of computing devices |
CN107910003A (zh) * | 2017-12-22 | 2018-04-13 | 智童时刻(厦门)科技有限公司 | 一种用于智能设备的语音交互方法及语音控制系统 |
US10762900B2 (en) * | 2018-03-07 | 2020-09-01 | Microsoft Technology Licensing, Llc | Identification and processing of commands by digital assistants in group device environments |
-
2018
- 2018-05-23 CN CN201810501073.7A patent/CN108877791B/zh active Active
-
2019
- 2019-01-18 JP JP2020502486A patent/JP6952184B2/ja active Active
- 2019-01-18 WO PCT/CN2019/072339 patent/WO2019223351A1/zh active Application Filing
-
2020
- 2020-05-29 US US16/888,426 patent/US11727927B2/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104205010A (zh) * | 2012-03-30 | 2014-12-10 | 英特尔公司 | 语音启用的触摸屏用户界面 |
CN103377028A (zh) * | 2012-04-20 | 2013-10-30 | 纽安斯通讯公司 | 用于以语音启动人机界面的方法和系统 |
CN105161106A (zh) * | 2015-08-20 | 2015-12-16 | 深圳Tcl数字技术有限公司 | 智能终端的语音控制方法、装置及电视机系统 |
CN106486118A (zh) * | 2016-09-30 | 2017-03-08 | 北京奇虎科技有限公司 | 一种应用的语音控制方法和装置 |
US20180096686A1 (en) * | 2016-10-04 | 2018-04-05 | Microsoft Technology Licensing, Llc | Combined menu-based and natural-language-based communication with chatbots |
CN107992587A (zh) * | 2017-12-08 | 2018-05-04 | 北京百度网讯科技有限公司 | 一种浏览器的语音交互方法、装置、终端和存储介质 |
CN108877791A (zh) * | 2018-05-23 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | 基于视图的语音交互方法、装置、服务器、终端和介质 |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111767021A (zh) * | 2020-06-28 | 2020-10-13 | 广州小鹏车联网科技有限公司 | 语音交互方法、车辆、服务器、系统和存储介质 |
CN113031905A (zh) * | 2020-06-28 | 2021-06-25 | 广州小鹏汽车科技有限公司 | 语音交互方法、车辆、服务器、系统和存储介质 |
EP3958110A1 (en) * | 2020-08-17 | 2022-02-23 | Beijing Xiaomi Pinecone Electronics Co., Ltd. | Speech control method and apparatus, terminal device, and storage medium |
US11749273B2 (en) | 2020-08-17 | 2023-09-05 | Beijing Xiaomi Pinecone Electronics Co., Ltd. | Speech control method, terminal device, and storage medium |
Also Published As
Publication number | Publication date |
---|---|
JP2020527753A (ja) | 2020-09-10 |
US20200294505A1 (en) | 2020-09-17 |
US11727927B2 (en) | 2023-08-15 |
CN108877791A (zh) | 2018-11-23 |
CN108877791B (zh) | 2021-10-08 |
JP6952184B2 (ja) | 2021-10-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019223351A1 (zh) | 基于视图的语音交互方法、装置、服务器、终端和介质 | |
JP7029613B2 (ja) | インターフェイススマートインタラクティブ制御方法、装置、システム及びプログラム | |
CN108133707B (zh) | 一种内容分享方法及系统 | |
KR102429436B1 (ko) | 사용자의 입력 입력에 기초하여 타겟 디바이스를 결정하고, 타겟 디바이스를 제어하는 서버 및 그 동작 방법 | |
JP6713034B2 (ja) | スマートテレビの音声インタラクティブフィードバック方法、システム及びコンピュータプログラム | |
TW201826112A (zh) | 基於語音的互動方法、裝置、電子設備及操作系統 | |
JP6848147B2 (ja) | 音声インタラクション実現方法、装置、コンピュータデバイス及びプログラム | |
JP2020004376A (ja) | 第三者アプリケーションのインタラクション方法、及びシステム | |
CN1763842B (zh) | 用于语音识别中的动词错误恢复的方法和系统 | |
US11164571B2 (en) | Content recognizing method and apparatus, device, and computer storage medium | |
WO2019047878A1 (zh) | 语音操控终端的方法、终端、服务器和存储介质 | |
CN108882101B (zh) | 一种智能音箱的播放控制方法、装置、设备及存储介质 | |
JP2020038709A (ja) | 人工知能機器における連続会話機能 | |
WO2023109129A1 (zh) | 语音数据的处理方法及装置 | |
CN111539217B (zh) | 一种用于自然语言内容标题消歧的方法、设备和系统 | |
WO2021098175A1 (zh) | 录制语音包功能的引导方法、装置、设备和计算机存储介质 | |
WO2023184266A1 (zh) | 语音控制方法及装置、计算机可读存储介质、电子设备 | |
JP6944920B2 (ja) | スマートインタラクティブの処理方法、装置、設備及びコンピュータ記憶媒体 | |
CN111580766B (zh) | 一种信息显示方法、装置和信息显示系统 | |
CN114694661A (zh) | 一种第一终端设备、第二终端设备和语音唤醒方法 | |
JP2019091448A (ja) | 設備の発現方法、装置、設備及びプログラム | |
CN112102820B (zh) | 交互方法、交互装置、电子设备和介质 | |
CN116226358A (zh) | 对话推荐语料生成方法、装置、设备和介质 | |
CN112450116A (zh) | 一种宠物管理方法、装置、系统、设备及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19808105 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2020502486 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19808105 Country of ref document: EP Kind code of ref document: A1 |