CN112102820A - Interaction method, interaction device, electronic equipment and medium - Google Patents

Interaction method, interaction device, electronic equipment and medium Download PDF

Info

Publication number
CN112102820A
CN112102820A CN201910527533.8A CN201910527533A CN112102820A CN 112102820 A CN112102820 A CN 112102820A CN 201910527533 A CN201910527533 A CN 201910527533A CN 112102820 A CN112102820 A CN 112102820A
Authority
CN
China
Prior art keywords
information
system event
event
voice
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910527533.8A
Other languages
Chinese (zh)
Inventor
柳刘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huijun Technology Co.,Ltd.
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201910527533.8A priority Critical patent/CN112102820A/en
Publication of CN112102820A publication Critical patent/CN112102820A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The present disclosure provides an interaction method, an interaction apparatus, an electronic device, and a medium, the interaction method including: receiving voice information; determining whether information associated with a system event is included in the voice information, wherein the system event is an event corresponding to information inputtable by at least one input device supported by the electronic equipment; when it is determined that the information associated with the system event is included in the voice information, the electronic device distributes the system event to an object related to the system event.

Description

Interaction method, interaction device, electronic equipment and medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to an interaction method and an interaction apparatus suitable for an electronic device, and a medium.
Background
With the arrival of the Artificial Intelligence (AI) era, voice interaction becomes an increasingly important user entrance, and users can complete the operations of device control, task execution and the like only in a voice interaction mode, which can be completed by input devices such as a mouse, a keyboard, a touch pad and the like. The problem that the existing part of people cannot use computers and intelligent terminals is solved through voice interaction, and the voice interaction is considered to be one of new important changes after Application (APP).
In the existing voice interaction technology, interaction is usually performed by using natural language aiming at functions of specific applications to solve interaction problems, such as "put a song for me", "open a WeChat" to instruct a specific application to play a song or open an application. The technology completely abandons the habit of graphical operation interfaces on mobile terminals APP and Personal Computers (PC).
In the course of implementing the disclosed concept, the inventors found that there are at least the following problems in the prior art: the existing voice interaction technology is tightly coupled with the application, so that the application is not easy to multiplex, and in addition, for the existing application needing the operation of input equipment such as a keyboard, a mouse and the like, when the operation mode is converted into the voice operation, the use habit of a user needs to be re-developed.
Disclosure of Invention
In view of the above, the present disclosure provides an interaction method, an interaction apparatus, an electronic device, and a medium that decouple a voice interaction technology from an application and do not require training of new usage habits.
One aspect of the present disclosure provides an interaction method performed by an electronic device, including operations of: first, voice information is received, then, whether information associated with a system event is included in the voice information is determined, wherein the system event is an event corresponding to inputtable information of at least one input device supported by the electronic device, and then, when it is determined that the information associated with the system event is included in the voice information, the electronic device distributes the system event to an object related to the system event.
The interaction method provided by the present disclosure determines whether the voice message includes information of a system event corresponding to the inputtable information of the input device, such as clicking a mouse, inputting a letter a, or clicking a touch screen, and if so, distributes the system event. Therefore, the input voice information can be directly converted into instructions input by operating input devices such as a keyboard, a mouse and the like, the voice information is further converted into system events, the system events are distributed to applications through the system, man-machine interaction is realized, the applicable applications are very wide, for example, various applications installed on electronic equipment with the input devices, and voice interaction related technologies which need to be paid attention to by application developers are omitted.
According to an embodiment of the present disclosure, the determining whether the voice information includes information associated with a system event includes: matching the voice information in an acoustic feature library, and determining whether the voice information comprises acoustic features corresponding to system events, wherein the acoustic feature library is stored in the electronic equipment and comprises corresponding relations between the acoustic features and the system events. Because the acoustic feature library can be trained offline in advance, whether the system event associated information is included in the voice information or not is determined quickly and accurately based on the acoustic feature library.
According to an embodiment of the present disclosure, the method may further include constructing the acoustic feature library, wherein the constructing the acoustic feature library may include: first, the system events and corresponding acoustic features are obtained, and then a mapping model between the system events and the corresponding acoustic features is generated.
According to an embodiment of the present disclosure, the constructing the acoustic feature library may include the operations of: firstly, obtaining text information of a system event corresponding to inputtable information of the at least one input device, then generating acoustic features of the text information based on acoustic features of voice units, and then storing the system event, the acoustic features and the mapping model. The speech units may constitute various basic units of speech, such as phonemes, words, etc., so that the acoustic features of the system events may be synthesized based on the speech units.
According to an embodiment of the present disclosure, the obtaining the system event and the corresponding acoustic feature may include the operations of: firstly, a system event corresponding to the inputtable information of the at least one input device is obtained, then, voice information of the system event corresponding to the inputtable information of the at least one input device is obtained, and then, acoustic feature extraction is carried out on the voice information of the system event. Thus, acoustic features of system events can be obtained using the speech modeling tool.
According to an embodiment of the present disclosure, the method may further include updating the acoustic feature library independently of an application installed in the electronic device. This eliminates the need to update the corresponding application at the same time as the acoustic feature library needs to be updated.
According to an embodiment of the present disclosure, the method may further include the operations of: when the voice information is determined not to include information associated with a system event, voice recognition is performed on the voice information to obtain text information, then semantic information of the text information is determined, then whether the semantic information includes semantic information associated with the system event is determined, and when the semantic information includes the semantic information associated with the system event, the electronic device distributes the system event to an object related to the system event. When the voice information does not include the voice information corresponding to the system event, voice recognition and semantic analysis are performed to obtain semantic information, and then the system event can be obtained according to the semantic information, so that multiple voice expression modes aiming at the same semantic meaning are effectively covered, and the user experience is facilitated to be improved.
According to the embodiment of the disclosure, the semantic information of the system event comprises standard semantic information and extended semantic information, wherein the extended semantic information semantics is obtained by extending the semantics of the standard semantic information.
Another aspect of the present disclosure provides an interaction apparatus performed by an electronic device, which may include: the electronic equipment comprises an information receiving module, a first event determining module and a first event distributing module, wherein the information receiving module is used for receiving voice information, the first event determining module is used for determining whether information related to a system event is included in the voice information, the system event is an event corresponding to the inputtable information of at least one input device supported by the electronic equipment, and the first event distributing module is used for distributing the system event to an object related to the system event when the information related to the system event is determined to be included in the voice information.
According to an embodiment of the present disclosure, the first event determining module is specifically configured to match the speech information in an acoustic feature library, and determine whether the speech information includes an acoustic feature corresponding to a system event, where the acoustic feature library is stored in the electronic device, and the acoustic feature library includes a correspondence between the acoustic feature and the system event.
According to an embodiment of the present disclosure, the apparatus may further include: a model library construction module, the model library construction module comprising: the system event mapping method comprises an acoustic feature obtaining unit and a mapping model generating unit, wherein the acoustic feature obtaining unit is used for obtaining the system event and corresponding acoustic features, and the mapping model generating unit is used for generating a mapping model between the system event and the corresponding acoustic features.
According to an embodiment of the present disclosure, the acoustic feature acquisition unit includes: the event obtaining unit is used for obtaining system events corresponding to the inputtable information of the at least one input device, the voice information obtaining unit is used for obtaining voice information of the system events corresponding to the inputtable information of the at least one input device, and the acoustic feature obtaining unit is used for performing acoustic feature extraction on the voice information of the system events.
According to an embodiment of the present disclosure, the apparatus may further include an update module for updating the acoustic feature library independently of an application installed in the electronic device.
According to an embodiment of the present disclosure, the apparatus further comprises: the system comprises a voice recognition module, a semantic information acquisition module, a second event determination module and a second event distribution module. The electronic device comprises a voice recognition module, a semantic information acquisition module, a second event determination module and an event distribution module, wherein the voice recognition module is used for performing voice recognition on the voice information to obtain text information when determining that the voice information does not comprise information related to a system event, the semantic information acquisition module is used for determining semantic information of the text information, the second event determination module is used for determining whether the semantic information comprises semantic information related to the system event, and the event distribution module is used for distributing the system event to an object related to the system event by the electronic device when the semantic information comprises the semantic information related to the system event.
According to the embodiment of the disclosure, the semantic information of the system event comprises standard semantic information and extended semantic information, wherein the extended semantic information semantics is obtained by extending the semantics of the standard semantic information.
Another aspect of the present disclosure provides an electronic device comprising one or more processors and a storage, wherein the storage is configured to store executable instructions that, when executed by the processors, implement the method as described above.
Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.
Another aspect of the disclosure provides a computer program comprising computer executable instructions for implementing the method as described above when executed.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:
fig. 1A schematically illustrates an application scenario of an interaction method, an interaction apparatus, an electronic device and a medium according to an embodiment of the present disclosure;
FIG. 1B schematically illustrates an exemplary system architecture to which the interaction method according to an embodiment of the present disclosure is applicable;
FIG. 2 schematically shows a flow chart of an interaction method performed by an electronic device according to an embodiment of the present disclosure;
FIG. 3A schematically illustrates a flow chart of an interaction method performed by an electronic device, according to another embodiment of the present disclosure;
FIG. 3B schematically illustrates a logic diagram for natural speech manipulation according to an embodiment of the present disclosure;
FIG. 4A schematically illustrates a block diagram of an interaction apparatus performed by an electronic device, in accordance with an embodiment of the present disclosure;
FIG. 4B schematically illustrates a block diagram of a speech manipulation module according to an embodiment of the present disclosure; and
fig. 5 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). The terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features.
The speech interaction is usually realized by an application layer through speech recognition and analysis to obtain a text or a control instruction, and the text or the control instruction is converted into application logic to achieve the purpose of control or interaction. For example, the user sends the voice message of 'Jingdong, i want to buy television' to the current application, and the current application receives the voice and then determines the voice to be a purchase instruction through algorithm processing and semantic analysis. The voice interaction mode is tightly coupled with the application, the multiplexing is not easy, a large amount of acoustic characteristic information such as a voice model can be embedded in the application layer, and in addition, a large amount of semantic models can be embedded, so that a large amount of storage space is occupied (the mode is also called as an application package is heavier), the application is required to be reinstalled or upgraded along with the change of the application function, the voice model and the semantic model are required to be reinstalled or upgraded at the same time, and the popularization and the use of the application are inconvenient.
Through analysis, the key problem of the prior art is that the new voice interaction mode directly completes tasks, conversations and the like in a mode of analyzing complete instructions (intentions) aiming at specific functions of applications. For the mode that the input device such as a keyboard, a mouse and the like needs to be operated to input an instruction in the past and is changed into the mode of inputting the instruction by voice operation, new use habits of users need to be re-developed, the period is long, the user experience is obstructed, the connection with the existing system or application (such as Excel programs supported by a desktop computer) is large, and the popularization of the voice interaction mode is not facilitated. In addition, the training cost of the speech model and the semantic model is also high, and the speech model and the semantic model need to be retrained for newly developed functions of different applications, thereby further influencing the popularization of a speech interaction mode.
The embodiment of the disclosure provides an interaction method, an interaction device, electronic equipment and a medium. The method comprises a system event identification process and a system event distribution process. In the system event recognition process, whether information related to a system event is included in the received voice information is determined, wherein the system event is an event corresponding to information that can be input by at least one input device supported by the electronic equipment. And after the system event recognition process is completed, entering a system event distribution process, and when the voice information is determined to include the information related to the system event, distributing the system event to an object related to the system event by the electronic equipment.
Fig. 1A schematically illustrates an application scenario of an interaction method, an interaction apparatus, an electronic device and a medium according to an embodiment of the present disclosure.
As shown in fig. 1A, a user 10 uses a notebook computer 20 for office work for a long time, can be proficient in operating office software such as Excel software (called applications in Android operating system and IOS operating system) and Word software supported by the Windows operating system, and recently wants to work in a voice interaction manner, for example, currently open Word software, and the user 10 wants to input a number 528 in a row where a cursor is located and input a bed in the next row. Because the keyboard can input various combinations of characters which can be input by the keyboard and are difficult to count in Word software, such as input a, input b, input ab, input abc, input aa, input aca, input nice, input all normal, input work scheduling (work scheduling) and the like, wherein each input is complete voice information, the prior art needs to respectively carry out voice model training on each complete voice information, the combination is inexhaustible and is extremely difficult to realize, in the prior voice interaction technology, because the voice models are respectively set for a certain application, and because the number of the applied functions is limited, the corresponding voice models can be respectively trained for each function, such as the setting of the air conditioner temperature (which corresponds to a specific function of the application) is 28 ℃, a voice model can be trained for each song such as a child and the like, the speech models are trained for the application according to the instruction of the specific function of the application, but the corresponding speech model library is arranged in each application, and the volume of the software package is increased. In order to solve the problem, the prior art also has a mode of performing voice recognition on the received voice information to obtain text information, and then determining a corresponding instruction according to the text information. However, since the voice model library involved in voice recognition has a very large capacity and a very high requirement on computing power, it is basically impossible to implement the voice model library on a personal terminal, and usually a server performs voice recognition on voice information and then sends the recognized text information to the personal terminal, which is not applicable in many scenarios, such as when the personal terminal is not networked or is inconvenient to be networked. In addition, because many words with similar pronunciation, such as bad and bed, exist in the speech, when the overall recognition is performed, the recognition is easy to be wrong.
In order to solve the above problem, embodiments of the present disclosure determine whether information associated with a system event is included in received voice information, where the system event is an event corresponding to inputtable information of at least one input device supported by the electronic device, and the number of system events corresponding to inputtable information of at least one input device is limited. For example, the keyboard can input information, such as input a, input b, input enter (line feed), input F4, input a, input locked capitalization and the like. The inputtable information is small in quantity and large in pronunciation difference, and the information of the system event corresponding to the inputtable information can be accurately identified by using the voice model packet with a small volume, so that the inputtable information can be popularized to software of various operating systems. As shown in fig. 1A, the user 10 only needs to open the Word software, and then says "inputs 5, 2, and 8" (where there may be a pause between 5, 2, and 8, or "input 5, input 2, and input 8"), "input wrapping", "inputs b, e, and d", and so on.
Fig. 1B schematically illustrates an exemplary system architecture to which the interaction method according to an embodiment of the present disclosure is applicable. It should be noted that fig. 1B is only an example of a system architecture 100 to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1B, the system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or transmit information or the like. The terminal devices 101, 102, 103 may have installed thereon various communication client applications, such as shopping applications, web browser applications, office applications, search applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 101, 102, 103 may be various electronic devices that support at least one input means, including but not limited to smart phones, tablets, laptop and desktop computers, smart televisions, and the like.
The server 105 may be a server providing various services, such as a background management server (for example only) providing support for application updates, downloads and speech recognition requested by the user with the terminal devices 101, 102, 103. The background management server may analyze and otherwise process the received data such as the user request, and feed back a processing result (for example, information or data obtained or generated according to the user request) to the terminal device.
It should be noted that the interaction method provided by the embodiments of the present disclosure may be generally executed by the terminal devices 101, 102, and 103.
It should be understood that the number of terminal devices, networks, and servers are merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Fig. 2 schematically shows a flow chart of an interaction method performed by an electronic device according to an embodiment of the present disclosure.
As shown in fig. 2, the method includes operations S201 to S205.
In operation S201, voice information is received.
In this embodiment, the voice information may be collected by a sound sensor of the electronic device, such as a microphone, and may also be received voice information sent from other electronic devices, such as audio information.
In operation S203, it is determined whether information associated with a system event is included in the voice information, wherein the system event is an event corresponding to inputtable information of at least one input device supported by the electronic apparatus.
Specifically, the determining whether the voice information includes information associated with a system event may specifically include the following operations.
And matching the voice information in an acoustic feature library, and determining whether the voice information comprises acoustic features corresponding to system events. Wherein the acoustic feature library is stored in the electronic device and includes a correspondence between acoustic features and system events.
For example, the keyboard of the electronic device has a pgUP key and a pgDn key, where the voice information corresponding to the input information of the pgUP key may be "previous page", and the like, and the voice information corresponding to the input information of the pgDn key may be "page turning", "next page", and the like, respectively. If a pronunciation of "previous page" is detected in the voice message, the corresponding system event may be determined to include one of: WM _ KEYDOWN, WM _ CHAR (KeyPageDown), WM _ KEYUP, and the like. The instructions of the same system event in different operating systems may be different, and therefore, the instructions of the specific system event need to be determined according to the system used.
In operation S205, when it is determined that the information associated with the system event is included in the voice information, the electronic device distributes the system event to an object related to the system event.
For example, when a user uses a Word application, the user utters voice information: and when the user turns down, inputs a and the like, the electronic equipment can send the instruction of the system event to Word software by the system, so that the voice interaction between the user and the electronic equipment is realized.
In another embodiment, the acoustic feature library may be constructed in the following manner.
For example, constructing the acoustic feature library may include the following operations.
First, the system events and corresponding acoustic features are obtained. For example, what the inputtable information of various keyboards is can be obtained through statistics, and the name of the corresponding system event in different operating systems and the instruction of the inputtable information can be obtained through statistics. In addition, acoustic features corresponding to system events can be obtained through a speech synthesis mode.
After determining the inputtable information of the input device, collecting the inputtable information or common voice expression modes of the inputtable information and operation, and then performing feature extraction on the common voice expression modes to obtain acoustic features corresponding to the system events.
Then, a mapping model between the system events and the corresponding acoustic features is generated.
Such as the mapping model of the acoustic features of the "previous page" with WM _ KEYDOWN, WM _ CHAR, WM _ KEYUP, etc. The system events (e.g., instructions), corresponding acoustic features, and mapping models may then be stored in an acoustic feature library.
The acoustic feature library can be constructed off-line, is suitable for systems with the same system events, and can be optimized and updated in the using process.
In a particular embodiment, the obtaining the system events and corresponding acoustic features may include the following operations.
First, system events corresponding to the inputtable information of the at least one input device are obtained.
Then, voice information of a system event corresponding to the inputtable information of the at least one input device is obtained.
Then, the voice information of the system event is subjected to acoustic feature extraction, for example, the acoustic feature of the voice information corresponding to the system event is obtained by using a voice model tool. The method of feature extraction is not limited.
For example, input data is trained using speech model tools and algorithms, such as random fields (CRF), deep learning, etc., as shown in table 1 to arrive at a system event mapping table schematic for windows system speech tables.
TABLE 1 Windows System Speech Table reach System event mapping Table schematic
Figure BDA0002098701610000111
Where different operating systems may differ in name and definition for the same system event (e.g., right mouse click). Thus, the phonetic representation to system event mappings for different operating systems may be different.
Next, a model of the mapping of speech to system events is generated. The model is independent of running software, can be independently updated, and generates a voice control software package, namely, a 'voice decoding library' is acted on the generated model to develop extended functions, so that a 'voice control software module' is generated.
In another particular embodiment, the obtaining the system events and corresponding acoustic features may include the following operations.
Firstly, text information of a speech expression mode of a system event corresponding to the inputtable information of the at least one input device is obtained.
Then, acoustic features of the textual information are generated based on the acoustic features of the speech units. Among them, the voice unit includes but is not limited to: consonants, vowels, initials, finals, characters, words, phrases, and the like.
Next, the system events, the acoustic features, and the mapping model are stored. Wherein the mapping model is a model between the system events and the corresponding acoustic features.
It should be noted that, since the acoustic feature library is decoupled from the application installed in the electronic device, for example, the acoustic feature library is not encapsulated in the application, but encapsulated in the system, and the acoustic feature library is called by the system, it is possible to update the acoustic feature library independently from the application installed in the electronic device.
The interaction method provided by the disclosure can directly convert the input voice information into an instruction input by operating an input device such as a keyboard and a mouse, and further convert the voice information into a system event. The system distributes system events to applications, so that man-machine interaction is realized, the applicable applications are very wide, such as various applications installed on electronic equipment with an input device, and voice interaction related technologies which are usually concerned by application developers are avoided.
Fig. 3A schematically shows a flow chart of an interaction method performed by an electronic device according to another embodiment of the present disclosure.
Because the same semantic speech expression mode has various forms, such as different expression habits in different places, particularly dialects and the like, it is difficult to include the acoustic features of various dialects for one system event in the acoustic feature library, so that the semantic information of the speech information can be obtained through speech recognition, and then the system event corresponding to the speech information is determined based on the semantic information.
As shown in fig. 3A, the method may further include the following operations.
In operation S301, when it is determined that information associated with a system event is not included in the voice information, voice recognition is performed on the voice information to obtain text information.
The Speech Recognition technology may adopt, for example, an Automatic Speech Recognition (ASR) technology, and text information corresponding to the Speech information may be obtained by the ASR technology. The operation may be implemented locally or via a server.
In operation S303, semantic information of the text information is determined.
In operation S305, it is determined whether the semantic information includes semantic information associated with a system event.
In operation S307, when the semantic information includes semantic information associated with a system event, the electronic device distributes the system event to an object related to the system event.
It should be noted that, if the semantic information does not include semantic information associated with a system event, the system layer has determined that there is no associated information of the system event in the voice information, but if a voice interaction function is set in the application, the voice interaction may continue to be performed with respect to the voice interaction function set in the application, for example, the application layer detects whether a voice instruction that can be recognized by the application is included in the voice information.
In another embodiment, the semantic information of the system event includes standard semantic information and extended semantic information, and the extended semantic information semantic is obtained by extending the semantic of the standard semantic information. If the expression difference of different age groups to the same object is large, the children are easy to pronounce the 'X' as 'cha', and the keyboard can only input the 'X', the real semantic meaning of the children should be to input the 'X'. Therefore, the situation that the voice information with the same semantic meaning in different scenes is different can be conveniently dealt with, and the user experience degree is improved.
The following description will be given with reference to a specific example.
FIG. 3B schematically illustrates a logic diagram for natural speech manipulation according to an embodiment of the present disclosure.
As shown in fig. 3B, after the user turns on the voice manipulation function (by means of a switch, a wakeup word, or the like), the voice information is matched in the acoustic feature library.
And if the matching is successful, directly coding the system event corresponding to the matched acoustic feature to obtain the system event.
And if the matching is successful, carrying out voice processing on the voice information through an ASR technology to obtain text information of the voice information. And then performing semantic analysis on the text information, and matching the text information with an instruction corresponding to the system event (for example, the text information page turning can be matched with the system event of PAGEDOWN).
After obtaining the system event, the operating system distributes the system event. The distribution of the system events may employ existing techniques, which are not limited herein.
In another embodiment, the attribute information of the user may also be determined based on the acoustic features of the user, for example, the age of the user, and then the accuracy of the obtained semantic information is improved based on the attribute information of the user, for example, after receiving the voice information, the voice information uttered by a young child is determined, such as that the young child speaks to a computer: if the computer only has a mouse and a keyboard as input equipment, namely, the inputtable information of the input device does not have the circle, the semantic high probability of the voice information can be determined to be input OO based on the expanded semantic information, and the sound production habit of the infant is to make a diphone, so that the real intention of the infant can be determined to be to input O. Therefore, the real intention of the user can be determined based on the attribute information of the user, the input information of the input device and the expanded semantic information, and the interactive experience is effectively improved.
Fig. 4A schematically illustrates a block diagram of an interaction device executed by an electronic device according to an embodiment of the present disclosure.
As shown in fig. 4A, the interaction apparatus 400 may include an information receiving module 410, a first event determining module 430, and a first event distributing module 450.
The information receiving module 410 is configured to receive voice information.
The first event determining module 430 is configured to determine whether information associated with a system event is included in the voice information, where the system event is an event corresponding to information that can be input by at least one input device supported by the electronic device.
The first event distribution module 450 is configured to, when it is determined that the information associated with the system event is included in the voice information, distribute the system event to an object related to the system event by the electronic device.
Specifically, the first event determining module 430 may be specifically configured to match the speech information in an acoustic feature library, and determine whether an acoustic feature corresponding to a system event is included in the speech information, where the acoustic feature library is stored in the electronic device, and the acoustic feature library includes a correspondence between the acoustic feature and the system event.
Furthermore, the apparatus 400 may further include: and a model library construction module.
The model library construction module comprises: the device comprises an acoustic feature acquisition unit and a mapping model generation unit.
The acoustic feature acquisition unit is used for acquiring the system event and corresponding acoustic features.
The mapping model generating unit is used for generating a mapping model between the system event and the corresponding acoustic feature.
Wherein the acoustic feature acquisition unit may include: the device comprises an event obtaining subunit, a voice information obtaining subunit and an acoustic feature obtaining subunit.
The event obtaining subunit is used for obtaining a system event corresponding to the inputtable information of the at least one input device.
The voice information obtaining subunit is used for obtaining the voice information of the system event corresponding to the inputtable information of the at least one input device.
The acoustic feature acquisition subunit is configured to perform acoustic feature extraction on the voice information of the system event.
FIG. 4B schematically illustrates a block diagram of a speech manipulation module according to an embodiment of the present disclosure.
As shown in fig. 4B, the voice manipulation module 4000 is a real-time operation system of voice interaction, and includes: the system comprises a voice receiving sub-module, a voice model management sub-module, a matching sub-module and an ASR sub-module.
The voice receiving submodule is used for receiving voice information and coding the voice information after the operating system enters a voice control mode.
The voice model management submodule is used for storing and managing acoustic characteristics related to voice control, such as a voice model. The acoustic features may be generated by training and may be updated individually.
And the matching submodule is used for matching the input voice information with the acoustic characteristics, and if the matching is successful, converting to obtain a system event. If the matching fails, the text information obtained by voice recognition can be subjected to semantic matching, and if the semantic matching succeeds, the text information can be converted into a system event.
And the ASR submodule is used for converting the voice information into text information.
In another embodiment, the apparatus 400 may further include an update module for updating the acoustic feature library independently of an application installed in the electronic device.
In order to increase the success rate of voice interaction, the apparatus 400 further comprises: the system comprises a voice recognition module, a semantic information acquisition module, a second event determination module and a second event distribution module.
The voice recognition module is used for performing voice recognition on the voice information to obtain text information when the voice information is determined not to include information related to the system event.
The semantic information acquisition module is used for determining semantic information of the text information.
The second event determination module is to determine whether the semantic information includes semantic information associated with a system event.
The second event distribution module is used for distributing the system event to an object related to the system event by the electronic equipment when the semantic information comprises semantic information related to the system event.
In order to increase applicable population, the semantic information of the system event may include standard semantic information and extended semantic information, where the extended semantic information semantics is obtained by extending the semantics of the standard semantic information.
Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.
For example, any plurality of the information receiving module 410, the first event determining module 430, and the first event distributing module 450 may be combined in one module to be implemented, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the information receiving module 410, the first event determining module 430, and the first event distributing module 450 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or in any one of three implementations of software, hardware, and firmware, or in any suitable combination of any of them. Alternatively, at least one of the information receiving module 410, the first event determining module 430 and the first event distributing module 450 may be at least partially implemented as a computer program module, which when executed, may perform a corresponding function.
Fig. 5 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 5, an electronic device 500 according to an embodiment of the present disclosure includes a processor 501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. The processor 501 may comprise, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 501 may also include onboard memory for caching purposes. Processor 501 may include a single processing unit or multiple processing units for performing different actions of a method flow according to embodiments of the disclosure.
In the RAM 503, various programs and data necessary for the operation of the system 500 are stored, for example, an acoustic feature library is stored. The processor 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. The processor 501 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 502 and/or the RAM 503. Note that the programs may also be stored in one or more memories other than the ROM 502 and the RAM 503. The processor 501 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the present disclosure, system 500 may also include an input/output (I/O) interface 505, input/output (I/O) interface 505 also being connected to bus 504. The system 500 may also include one or more of the following components connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program, when executed by the processor 501, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include ROM 502 and/or RAM 503 and/or one or more memories other than ROM 502 and RAM 503 described above.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims (15)

1. An interaction method performed by an electronic device, comprising:
receiving voice information;
determining whether information associated with a system event is included in the voice information, wherein the system event is an event corresponding to information inputtable by at least one input device supported by the electronic equipment; and
when it is determined that the information associated with the system event is included in the voice information, the electronic device distributes the system event to an object related to the system event.
2. The method of claim 1, wherein:
the determining whether the voice information includes information associated with a system event comprises: matching the voice information in an acoustic feature library, and determining whether the voice information comprises acoustic features corresponding to system events;
wherein the acoustic feature library is stored in the electronic device and includes a correspondence between acoustic features and system events.
3. The method of claim 2, further comprising: constructing the acoustic feature library; the constructing the acoustic feature library comprises:
obtaining the system events and corresponding acoustic features; and
generating a mapping model between the system events and the corresponding acoustic features.
4. The method of claim 3, wherein the obtaining the system events and corresponding acoustic features comprises:
obtaining system events corresponding to inputtable information of the at least one input device;
obtaining voice information of a system event corresponding to the inputtable information of the at least one input device; and
and extracting acoustic features of the voice information of the system event.
5. The method of claim 2, further comprising updating the acoustic signature library independently of an application installed in the electronic device.
6. The method of claim 1, further comprising:
when determining that the voice information does not include information related to a system event, performing voice recognition on the voice information so as to obtain text information;
determining semantic information of the text information;
determining whether the semantic information includes semantic information associated with a system event; and
when the semantic information includes semantic information associated with a system event, the electronic device distributes the system event to an object related to the system event.
7. The method of claim 6, wherein:
the semantic information of the system event comprises standard semantic information and extended semantic information, wherein the extended semantic information semantics are obtained by extending the semantics of the standard semantic information.
8. An interaction apparatus performed by an electronic device, comprising:
the information receiving module is used for receiving voice information;
the electronic equipment comprises a first event determining module, a second event determining module and a processing module, wherein the first event determining module is used for determining whether information related to a system event is included in the voice information, and the system event is an event corresponding to information which can be input by at least one input device supported by the electronic equipment; and
the first event distribution module is used for distributing the system event to an object related to the system event by the electronic equipment when the voice information is determined to comprise the information related to the system event.
9. The apparatus of claim 8, wherein:
the first event determining module is specifically configured to match the speech information in an acoustic feature library, and determine whether the speech information includes an acoustic feature corresponding to a system event, where the acoustic feature library is stored in the electronic device and includes a correspondence between the acoustic feature and the system event.
10. The apparatus of claim 8, further comprising: a model library construction module, the model library construction module comprising:
the acoustic feature acquisition unit is used for acquiring the system event and corresponding acoustic features; and
a mapping model generation unit for generating a mapping model between the system events and the corresponding acoustic features.
11. The apparatus of claim 10, wherein the acoustic feature acquisition unit comprises:
the event obtaining subunit is used for obtaining a system event corresponding to the inputtable information of the at least one input device;
the voice information obtaining subunit is used for obtaining voice information of a system event corresponding to the inputtable information of the at least one input device; and
and the acoustic feature acquisition subunit is used for extracting acoustic features of the voice information of the system event.
12. The apparatus of claim 8, further comprising:
the voice recognition module is used for performing voice recognition on the voice information to obtain text information when the voice information is determined not to include information related to a system event;
the semantic information acquisition module is used for determining semantic information of the text information;
a second event determination module to determine whether the semantic information includes semantic information associated with a system event; and
the second event distribution module is used for distributing the system event to an object related to the system event by the electronic equipment when the semantic information comprises semantic information related to the system event.
13. An electronic device, comprising:
one or more processors;
storage means for storing executable instructions which, when executed by the processor, implement the method of any one of claims 1 to 7.
14. The electronic device of claim 13, wherein:
the storage device is also used for storing an acoustic feature library.
15. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, implement a method according to any one of claims 1 to 7.
CN201910527533.8A 2019-06-18 2019-06-18 Interaction method, interaction device, electronic equipment and medium Pending CN112102820A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910527533.8A CN112102820A (en) 2019-06-18 2019-06-18 Interaction method, interaction device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910527533.8A CN112102820A (en) 2019-06-18 2019-06-18 Interaction method, interaction device, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN112102820A true CN112102820A (en) 2020-12-18

Family

ID=73749362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910527533.8A Pending CN112102820A (en) 2019-06-18 2019-06-18 Interaction method, interaction device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN112102820A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100217604A1 (en) * 2009-02-20 2010-08-26 Voicebox Technologies, Inc. System and method for processing multi-modal device interactions in a natural language voice services environment
CN103778915A (en) * 2012-10-17 2014-05-07 三星电子(中国)研发中心 Speech recognition method and mobile terminal
KR20140089863A (en) * 2013-01-07 2014-07-16 삼성전자주식회사 Display apparatus, Method for controlling display apparatus and Method for controlling display apparatus in Voice recognition system thereof
US20150088524A1 (en) * 2013-09-24 2015-03-26 Diotek Co., Ltd. Apparatus and method for generating an event by voice recognition
CN104715752A (en) * 2015-04-09 2015-06-17 刘文军 Voice recognition method, voice recognition device and voice recognition system
CN107978313A (en) * 2016-09-23 2018-05-01 苹果公司 Intelligent automation assistant
CN108231063A (en) * 2016-12-13 2018-06-29 中国移动通信有限公司研究院 A kind of recognition methods of phonetic control command and device
CN108320742A (en) * 2018-01-31 2018-07-24 广东美的制冷设备有限公司 Voice interactive method, smart machine and storage medium
US20190080685A1 (en) * 2017-09-08 2019-03-14 Amazon Technologies, Inc. Systems and methods for enhancing user experience by communicating transient errors
CN109656512A (en) * 2018-12-20 2019-04-19 Oppo广东移动通信有限公司 Exchange method, device, storage medium and terminal based on voice assistant

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100217604A1 (en) * 2009-02-20 2010-08-26 Voicebox Technologies, Inc. System and method for processing multi-modal device interactions in a natural language voice services environment
CN103778915A (en) * 2012-10-17 2014-05-07 三星电子(中国)研发中心 Speech recognition method and mobile terminal
KR20140089863A (en) * 2013-01-07 2014-07-16 삼성전자주식회사 Display apparatus, Method for controlling display apparatus and Method for controlling display apparatus in Voice recognition system thereof
US20150088524A1 (en) * 2013-09-24 2015-03-26 Diotek Co., Ltd. Apparatus and method for generating an event by voice recognition
CN104715752A (en) * 2015-04-09 2015-06-17 刘文军 Voice recognition method, voice recognition device and voice recognition system
CN107978313A (en) * 2016-09-23 2018-05-01 苹果公司 Intelligent automation assistant
CN108231063A (en) * 2016-12-13 2018-06-29 中国移动通信有限公司研究院 A kind of recognition methods of phonetic control command and device
US20190080685A1 (en) * 2017-09-08 2019-03-14 Amazon Technologies, Inc. Systems and methods for enhancing user experience by communicating transient errors
CN108320742A (en) * 2018-01-31 2018-07-24 广东美的制冷设备有限公司 Voice interactive method, smart machine and storage medium
CN109656512A (en) * 2018-12-20 2019-04-19 Oppo广东移动通信有限公司 Exchange method, device, storage medium and terminal based on voice assistant

Similar Documents

Publication Publication Date Title
US11727927B2 (en) View-based voice interaction method, apparatus, server, terminal and medium
US10614803B2 (en) Wake-on-voice method, terminal and storage medium
CN111033492B (en) Providing command bundle suggestions for automated assistants
JP6960006B2 (en) How and system to handle unintentional queries in conversational systems
CN107210035B (en) Generation of language understanding systems and methods
JP5142720B2 (en) Interactive conversational conversations of cognitively overloaded users of devices
CN102915733A (en) Interactive speech recognition
JP7113047B2 (en) AI-based automatic response method and system
CN114830139A (en) Training models using model-provided candidate actions
US11810553B2 (en) Using backpropagation to train a dialog system
KR20200080400A (en) Method for providing sententce based on persona and electronic device for supporting the same
US11403462B2 (en) Streamlining dialog processing using integrated shared resources
EP3550449A1 (en) Search method and electronic device using the method
CN112487790A (en) Improved semantic parser including coarse semantic parser and fine semantic parser
WO2020052060A1 (en) Method and apparatus for generating correction statement
JP7044856B2 (en) Speech recognition model learning methods and systems with enhanced consistency normalization
CN116601648A (en) Alternative soft label generation
JP7182584B2 (en) A method for outputting information of parsing anomalies in speech comprehension
JP2022512271A (en) Guide method, device, device, program and computer storage medium for voice packet recording function
JP2022088586A (en) Voice recognition method, voice recognition device, electronic apparatus, storage medium computer program product and computer program
US20220180865A1 (en) Runtime topic change analyses in spoken dialog contexts
CN114047900A (en) Service processing method and device, electronic equipment and computer readable storage medium
CN112102820A (en) Interaction method, interaction device, electronic equipment and medium
JP2022121386A (en) Speaker dialization correction method and system utilizing text-based speaker change detection
CN109036379B (en) Speech recognition method, apparatus and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210525

Address after: 100176 room 1004, 10th floor, building 1, 18 Kechuang 11th Street, Beijing Economic and Technological Development Zone, Daxing District, Beijing

Applicant after: Beijing Huijun Technology Co.,Ltd.

Address before: 100086 8th Floor, 76 Zhichun Road, Haidian District, Beijing

Applicant before: BEIJING JINGDONG SHANGKE INFORMATION TECHNOLOGY Co.,Ltd.

Applicant before: BEIJING JINGDONG CENTURY TRADING Co.,Ltd.

TA01 Transfer of patent application right