WO2022057870A1 - 人机交互方法、装置和系统 - Google Patents

人机交互方法、装置和系统 Download PDF

Info

Publication number
WO2022057870A1
WO2022057870A1 PCT/CN2021/118906 CN2021118906W WO2022057870A1 WO 2022057870 A1 WO2022057870 A1 WO 2022057870A1 CN 2021118906 W CN2021118906 W CN 2021118906W WO 2022057870 A1 WO2022057870 A1 WO 2022057870A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
voiceprint
output
content
output position
Prior art date
Application number
PCT/CN2021/118906
Other languages
English (en)
French (fr)
Inventor
黄胜森
陈显义
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21868692.1A priority Critical patent/EP4209864A4/en
Publication of WO2022057870A1 publication Critical patent/WO2022057870A1/zh
Priority to US18/185,203 priority patent/US20230224181A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0487Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser
    • G06F3/0488Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/16Arrangements for providing special services to substations
    • H04L12/18Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
    • H04L12/1813Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms
    • H04L12/1831Tracking arrangements for later retrieval, e.g. recording contents, participants activities or behavior, network status
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/03Arrangements for converting the position or the displacement of a member into a coded form
    • G06F3/041Digitisers, e.g. for touch screens or touch pads, characterised by the transducing means
    • G06F3/0416Control or interface arrangements specially adapted for digitisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/10Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2203/00Indexing scheme relating to G06F3/00 - G06F3/048
    • G06F2203/038Indexing scheme relating to G06F3/038
    • G06F2203/0381Multimodal input, i.e. interface arrangements enabling the user to issue commands by simultaneous use of input devices of different nature, e.g. voice plus gesture on digitizer

Definitions

  • the present application relates to the technical field of human-computer interaction, and in particular, to methods, devices and systems for human-computer interaction.
  • HMI Human computer interaction
  • the systems here can be machines of all kinds, or computerized systems and software.
  • input can be achieved through touch operations.
  • touch operations due to cost and technical reasons, the way of realizing input on the touch screen through touch operation often leads to the situation that handwriting and handwriting are difficult to control, resulting in input difficulty and low input efficiency.
  • the human-computer interaction method, device and system provided by the present application help to output and display the voice content of one or more users to the corresponding output position on the touch screen by means of multi-modality human-computer interaction, Improve the efficiency of touch screen input and user experience.
  • the present application provides a human-computer interaction method, which is applied to a human-computer interaction system.
  • the method includes: firstly establishing a correspondence between the first voiceprint and the first output position on the touch screen; then, when receiving the first voice and judging that the voiceprint of the voice matches the above-mentioned first voiceprint, identifying the voice of the voice content, and output the identified content and display the above-mentioned first output position.
  • the first aspect of the present application it is possible to output and display the voice content input by one or more users to the output position indicated by the user on the touch screen, so as to avoid the mixed display of the voice content of different users when multiple people input through voice at the same time Problems on the touch screen, so the efficiency and experience of touch screen input can be improved.
  • the process of establishing the correspondence between the first voiceprint and the first output position on the touch screen may be: receiving a touch operation and a second voice, and judging the touch operation When it matches the predetermined first rule and the second voice matches the predetermined second rule, determine the first output position on the touch screen according to the position of the touch operation, and extract the first voiceprint from the second voice ; and then establish the corresponding relationship between the first voiceprint and the above-mentioned first output position.
  • the above process of determining the first output position according to the position of the touch operation may be: determining the first output position according to a set of contact positions in the position of the touch operation. The starting position and range of an output position.
  • the shape indicated by the above-mentioned first output position may be a rectangle, a circle, a diamond, or the like.
  • the first output position may include the coordinates of the upper left corner, and the width and height.
  • the first output position may further include its start position and end position, for example, the coordinates of the upper left corner and the lower right corner of the first output position.
  • the system can also output and display the first output position on the touch screen.
  • the first output position is displayed in the form of a frame, or other forms that can be distinguished from the current background of the touch screen.
  • the system can also generate and display the scroll bar on the touch screen in the horizontal or disposal direction of the first output position. In this implementation manner, it can ensure that the voice content input by the user is recorded, and it is convenient for the user to browse any recorded voice content at any time, thus further improving the user experience.
  • the above process of judging that the touch operation matches the predetermined first rule may be: judging that the set of contact positions in the position of the touch operation is consistent with the predetermined position rule . Or, a shape formed by a set of touch point positions in the position of the touch operation is recognized, and it is determined that the shape matches a predetermined shape.
  • the above process of judging that the second voice matches the predetermined second rule may be: identifying the content of the second voice, and judging that the content matches the predetermined content.
  • the system may also release the correspondence between the first voiceprint and the first output position.
  • the correspondence between the first voiceprint and other output positions can be established, so that the user corresponding to the first voiceprint can switch to other output positions to input his voice content.
  • the system may further establish a correspondence between the second voiceprint and the first output position.
  • the content of the voice can be output to the blank space of the first output position, or the above-mentioned first voice can be covered Content.
  • a function of supplementing or modifying the voice content of other users (the above-mentioned first voice) by a certain user through the voice (the above-mentioned third voice) can be realized. Therefore, through this implementation, multiple users can cooperate with each other to input content on the touch screen through voice, thereby improving the efficiency and experience of touch screen input.
  • the system may also select a different voice when outputting the content of the third voice than when outputting the content of the first voice.
  • the output format makes the display effect better.
  • the position of the sound source of the second voice can also be calculated, and before the above-mentioned correspondence between the first voiceprint and the first output position is established. , judging whether the sound source position of the second voice and the position of the touch operation satisfy a preset condition. If satisfied, the corresponding relationship between the first voiceprint and the first output position is established; otherwise, the corresponding relationship is not established.
  • the system can also receive the image in front of the touch screen collected by the image collector, and before establishing the corresponding relationship between the first voiceprint and the first output position, analyze and track the content of the image in real time, and It is determined according to the results of image analysis and tracking whether the user who performs the touch operation and the user who makes the second voice are the same user. Alternatively, it is determined by combining sound source localization and image tracking whether the user who performs the touch operation and the user who utters the second voice are the same user. If it is determined that it is the same user, the corresponding relationship is established; otherwise, the corresponding relationship is not established.
  • the present application provides a human-computer interaction method.
  • the human-computer interaction method is applied to computer equipment.
  • the method includes: first establishing a correspondence between a first voiceprint and a first output position; When it is judged that the voiceprint of the voice matches the first voiceprint, the content of the voice is recognized, and the content is output to the first output position.
  • the above-mentioned process of establishing the corresponding relationship between the first voiceprint and the first output position may be: receiving the contact position and the second voice, and judging the contact position and the contact position.
  • the predetermined first rule matches and the second voice matches the predetermined second rule the first output position is determined according to the position of the touch point, and the first voiceprint is extracted from the above-mentioned second voice; The corresponding relationship between the acoustic pattern and the above-mentioned first output position.
  • the above-mentioned process of determining the first output position according to the contact position may be: determining the first output position according to a set of contact positions in the contact position. starting position and range.
  • the shape indicated by the above-mentioned first output position may be a rectangle, a circle, a diamond, or the like.
  • the first output position may include the coordinates of the upper left corner, and the width and height.
  • the first output position may further include its start position and end position, for example, the coordinates of the upper left corner and the lower right corner of the first output position.
  • the computer device may also output the first output position; for example, display the first output position in a frame or other forms that can be distinguished from the current background.
  • the above process of judging that the contact position matches the predetermined first rule may be: judging whether the set of contact positions in the contact position matches the predetermined position rule Consistent. Or, a shape formed by a set of touch point positions in the touch point positions is identified, and it is judged whether the shape matches a predetermined shape.
  • the above-mentioned process of judging that the second voice matches the predetermined second rule may be: recognizing the content of the second voice, and judging whether the content matches the predetermined content .
  • the computer device may also release the correspondence between the first voiceprint and the first output position.
  • the computer device may further establish a correspondence between the second voiceprint and the first output position.
  • the content of the third voice may be output to the blank space of the first output position, or the above-mentioned third voice may be covered.
  • the content of a voice may be established.
  • the position of the sound source of the second voice can also be calculated, and before the above-mentioned correspondence between the first voiceprint and the first output position is established. , judging whether the sound source position of the second voice and the contact position satisfy a preset condition.
  • the computer device can also receive the image in front of the touch screen collected by the image collector, and analyze and track the above-mentioned image in real time before establishing the corresponding relationship between the first voiceprint and the first output position. content, and according to the results of image analysis and tracking, it is determined that the user who performs the touch operation and the user who emits the second voice are the same user. Alternatively, it is determined by combining sound source localization and image tracking that the user who performs the touch operation and the user who utters the second voice are the same user.
  • the present application provides a human-computer interaction system, and the human-computer interaction system can be used to execute any one of the methods provided in the first aspect or the second aspect.
  • the human-computer interaction system may include a touch screen and a processor.
  • the touch screen is used for receiving a touch operation and sending the position of the touch operation to the processor.
  • the processor executes any one of the human-computer interaction methods provided in the second aspect.
  • the relevant content and the description of the beneficial effects of any possible technical solution of the processor reference may be made to the technical solution provided by the second aspect or its corresponding possible design, which will not be repeated here.
  • the above-mentioned human-computer interaction system may also include a voice collector for collecting the first voice, the second voice and the third voice, and sending the first voice, the second voice and the third voice to the processor. .
  • the above-mentioned human-computer interaction system may further include an image collector for collecting an image near the front of the touch screen and sending it to the processor.
  • an image collector for collecting an image near the front of the touch screen and sending it to the processor.
  • the present application provides a computer device, which can be used to execute any of the methods provided in the second aspect above.
  • the computer device may specifically be a processor or a device including a processor.
  • the device may be divided into functional modules according to any of the methods provided in the second aspect.
  • the computer device includes a speech processing unit and an integrated processing unit.
  • the voice processing unit is configured to receive the first voice, and recognize the content of the voice when judging that the voiceprint of the first voice matches the first voiceprint.
  • the integrated processing unit is used for establishing the corresponding relationship between the first voiceprint and the first output position; and is also used for outputting the content of the first voice to the first output position.
  • the computer device further includes a touch point position processing unit for receiving a touch point position, the touch point position being generated by a touch operation; and for judging that the touch point position matches a predetermined first rule, according to the touch point position The position determines the first output position.
  • the voice processing unit is further configured to receive the second voice, and extract the first voiceprint from the voice when it is judged that the voice matches the predetermined second rule.
  • the contact position processing unit when determining the first output position according to the contact position, is specifically configured to: determine the starting position and range of the first output position according to the set of contact positions in the contact position.
  • the integrated processing unit is further configured to output the first output position.
  • the integrated processing unit is further configured to release the correspondence between the first voiceprint and the first output position.
  • the voice processing unit is further configured to receive the third voice, and recognize the content of the third voice when judging that the voiceprint of the third voice matches the second voiceprint.
  • the comprehensive processing unit is also used to establish the corresponding relationship between the second voiceprint and the first output position; and is used to output the content of the third voice to the blank of the first output position, or to cover the above-mentioned first output position.
  • the content of a voice is also configured to select an output format different from that when outputting the content of the first voice.
  • the voice processing unit is further configured to calculate the position of the sound source of the second voice.
  • the integrated processing unit is further configured to determine whether the position of the sound source of the second voice and the position of the touch point satisfy a preset condition before establishing the corresponding relationship between the first voiceprint and the first output position.
  • the above computer device may further include an image processing unit for receiving images.
  • the image processing unit is further configured to analyze and track the image in real time before the comprehensive processing unit establishes the correspondence between the first voiceprint and the first output position, and then judges the user who performs the touch operation and the user who sent the touch operation according to the results of the analysis and tracking. Whether the user of the first voice is the same user.
  • the computer device includes: a memory and one or more processors; the memory and the processor are coupled.
  • the above-mentioned memory is used to store computer program code
  • the computer program code includes computer instructions, when the computer instructions are executed by the computer equipment, the computer equipment is made to perform the human-computer interaction described in the second aspect and any possible design manner thereof method.
  • the present application provides a computer-readable storage medium, the computer-readable storage medium comprising computer instructions, when the computer instructions are executed on the human-computer interaction system, the human-computer interaction system can be implemented as the first aspect or the first aspect.
  • the present application provides a computer program product that, when the computer program product runs on a human-computer interaction system, enables the human-computer interaction system to implement any possible design method as provided in the first aspect or the second aspect The described human-computer interaction method.
  • FIG. 1 is a hardware structural diagram of a human-computer interaction system provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram 1 of a human-computer interaction system provided by an embodiment of the present application.
  • FIG. 3 is a second schematic structural diagram of a human-computer interaction system provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a human-computer interaction method provided by an embodiment of the present application.
  • 5A-5C are schematic diagrams of determining an output position according to a touch operation according to an embodiment of the present application.
  • FIG. 6A and FIG. 6B are schematic diagrams of a method for calculating the location of a user who utters a first voice according to an embodiment of the present application.
  • FIG. 6C is a schematic diagram of a method for judging a correspondence between speech and facial features provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a computer program product provided by an embodiment of the present application.
  • words such as “exemplary” or “for example” are used to represent examples, illustrations or illustrations. Any embodiments or designs described in the embodiments of the present application as “exemplary” or “such as” should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as “exemplary” or “such as” is intended to present the related concepts in a specific manner.
  • first and second are only used for description purposes, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features.
  • a feature defined as “first” or “second” may expressly or implicitly include one or more of that feature.
  • plural means two or more.
  • Embodiments of the present application provide a human-computer interaction method, device, and system.
  • the human-computer interaction system obtains a touch operation and a first voice, when it is determined that the touch operation conforms to a predetermined first rule and the first voice conforms to a predetermined
  • the second rule of the first voice is established, the corresponding relationship between the voiceprint of the first voice and the output position is established; wherein, the above-mentioned output position refers to the output position on the touch screen determined according to the touch operation; then the second voice is acquired, and the first voice is determined.
  • the voiceprint of the second voice matches the voiceprint of the first voice, and if so, output and display the text content corresponding to the second voice to the output position on the touch screen.
  • the above process before establishing the corresponding relationship between the voiceprint of the first voice and the output position, it can be further judged whether the position of the sound source of the first voice and the position of the touch operation satisfy a preset condition, If the judgment is yes, the above-mentioned corresponding relationship is established.
  • the above judgment is to further confirm whether the user who performs the touch operation and the user who makes the first voice are the same user, so that the accuracy of establishing the corresponding relationship can be improved, and finally the robustness of the system can be improved.
  • a camera can also be used to collect an image near the front of the touch screen, and the user who performs the touch operation can be judged according to real-time analysis and tracking of the collected image. Whether the user who uttered the first voice is the same user, this process can also improve the accuracy of establishing the corresponding relationship, thereby improving the robustness of the system.
  • one or more users can realize the input of the touch screen by voice, which improves the input efficiency; and the voice content input by each user can be output and displayed on the touch screen at the output position indicated by the user. , to avoid the problem that the voice content of different users is mixed up and displayed on the touch screen, so it can also improve the input experience.
  • the above-mentioned human-computer interaction method can be implemented by an application program installed on the device, such as a human-computer interaction application program.
  • the above application may be an embedded application installed in the device (ie, a system application of the device), or may be a downloadable application.
  • an embedded application is an application provided as part of the implementation of a device (such as a mobile phone).
  • a downloadable application is an application that can provide its own internet protocol multimedia subsystem (IMS) connection, which can be pre-installed in the device or can be downloaded and installed by the user in the Third-party apps on the device.
  • IMS internet protocol multimedia subsystem
  • FIG. 1 is a hardware structure of a human-computer interaction system provided by an embodiment of the present application.
  • the human-computer interaction system 10 includes a processor 11 , a memory 12 , a touch screen 13 and a voice collector 14 .
  • the human-computer interaction system 10 may further include an image collector 15 .
  • the processor 11 is the control center of the human-computer interaction system 10, and can be a general-purpose central processing unit (central processing unit, CPU), or other general-purpose processors, such as a graphics processor (Graphics processing unit, GPU). Wherein, the general-purpose processor may be a microprocessor or any conventional processor or the like. As an example, processor 11 may include one or more CPUs, such as CPU 0 and CPU 1 shown in FIG. 1 . Optionally, the processor 11 may further include one or more GPUs, such as GPU0 and GPU1 shown in FIG. 1 .
  • the memory 12 can be read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM) or other types of static storage devices that can store information and instructions type of dynamic storage device that may also be electrically erasable programmable read-only memory (EEPROM), disk storage media, or other magnetic storage devices, or that can be used to carry or store instructions or data structures in the form of desired program code and any other medium that can be accessed by a computer, but is not limited thereto.
  • the memory 12 can be independent of the processor 11 , can also be connected to the processor 11 through a bus, or can be integrated with the processor 11 .
  • Memory 12 is used to store data, instructions or program codes. When the processor 11 calls and executes the instructions or program codes stored in the memory 12, the human-computer interaction method provided by the embodiments of the present application can be implemented.
  • the touch screen 13 may specifically include a touch panel 131 and a display screen 132 .
  • the touch panel 131 may adopt various types such as resistive type, capacitive type, infrared ray and surface acoustic wave to realize the touch panel.
  • the touchpad 131 is used to collect touch events on or near the touchpad by the user (such as the user's operations on or near the touchpad using a finger, a stylus, or any other suitable object), and store the collected touch events on the touchpad 131 .
  • the touch information is sent to other devices (eg, the processor 11).
  • the user's touch events near the touchpad can be called floating touch; the floating touch can refer to that the user does not need to directly touch the touchpad in order to select, move or drag objects (such as icons, etc.), but only touch the touchpad.
  • the user is required to be near the device in order to perform the desired function.
  • the display screen 132 may be configured in the form of a liquid crystal display screen, an organic light emitting diode, or the like.
  • the touchpad 131 can be overlaid on the display screen 132. When the touchpad 131 detects a touch event on or near it, it transmits it to the processor 11 to determine the type of the touch event. Type provides corresponding visual output on display screen 132 .
  • the display screen 132 is used to display information input by the user or information provided to the user.
  • the voice collector 14 is also called “microphone”, or “microphone”, etc., and can be a single microphone; or alternatively, it can also be an array of microphones.
  • the voice collector 14 is used to receive the voice signal, convert the voice signal into an electrical signal, and then send it to other devices (such as the processor 11) for processing.
  • the voice collector 14 is a microphone array, it is also used to locate the position of the sound source.
  • the image collector 15 may be an imaging device such as CCD, CMOS, etc., also called “camera”.
  • the image collector 15 is used for collecting images, and sending the collected image data to other devices (eg, the processor 11 ) for processing.
  • the above-mentioned processor 11, memory 12, touch screen 13, voice collector 14 and image collector 15 can be integrated on one device.
  • the human-computer interaction system 10 can be an electronic whiteboard, a smart phone, a notebook with a touch screen Computer, computer with touch screen, tablet, netbook, vehicle and other terminal equipment. Exemplarily, if it is an electronic whiteboard, as shown in FIG. 2 , the above-mentioned human-computer interaction application program can run in the electronic whiteboard 20 .
  • the human-computer interaction system 10 may further include a touch pen 21 , and the touch pen 21 is used to input touch operations on the touch screen 13 of the electronic whiteboard 20 .
  • the above-mentioned processor 11, memory 12, touch screen 13, voice collector 14 and image collector 15 can also be integrated on different devices respectively.
  • the above-mentioned human-computer interaction system 10 can include multiple devices, to execute the human-computer interaction method provided by the embodiments of the present application.
  • the human-computer interaction system 10 shown in FIG. 3 may include: an electronic whiteboard 20 , a computer 32 and a projector 33 .
  • the human-computer interaction system 10 may further include a touch pen 21 , and the touch pen 21 is used to input touch operations on the touch screen 13 of the electronic whiteboard 20 .
  • the processor 11 may be the processor of the computer 32 .
  • the memory 12 may be the memory of the computer 32 .
  • the above-mentioned human-computer interaction application program can be executed in the computer 32 .
  • the touch screen 13 may be the touch screen of the electronic whiteboard 20 .
  • the voice collector 14 may be integrated in the electronic whiteboard 20 .
  • the voice collector 14 may also be integrated in the computer 32 , the projector 33 or the touch pen 21 , which is not limited in the implementation of this application.
  • the image collector 15 may be integrated in the electronic whiteboard 20 .
  • the voice collector 14 is a microphone array
  • its integration position and the touch screen need to satisfy a certain relationship;
  • the middle point coincides with the horizontal midpoint of the touch screen, for example, it can be integrated on the upper edge or the lower edge of the touch screen (as shown in FIG. 2 or FIG. 3 ).
  • its deployment position may also be on the left or right side of the touch screen, parallel to the vertical direction of the touch screen, and the midpoint of the center point coincides with the midpoint of the vertical direction of the touch screen.
  • the structure shown in FIG. 1 does not constitute a limitation of the human-computer interaction system 10.
  • the human-computer interaction system 10 may include more or less components than those shown, or a combination thereof Some components, or different component arrangements; the above description of the human-computer interaction system 10 is also only an exemplary illustration, and does not constitute a limitation to this embodiment.
  • FIG. 4 shows a schematic flowchart of a human-computer interaction method provided by an embodiment of the present application.
  • a conference room an electronic whiteboard is deployed in the conference room
  • the human-computer interaction method includes the following but not limited to the following steps:
  • the touch screen receives the touch operation, and determines the position information of the touch operation.
  • the touch screen When user A performs a touch operation on the touch screen with a finger or a touch pen, the touch screen will sense the operation and obtain the position information of the touch operation.
  • the position information of the touch operation may be the position of the touch point of the user A's finger or the touch pen on the touch screen when the user A performs the touch operation, for example, may be the coordinates of the touch point generated by the touch operation of the user A on the touch screen.
  • the position information of the touch operation can be the position of a single touch point on the touch screen, for example, the position of the touch point generated by user A touching the screen with a single finger; the position information of the touch operation can also be the position of multiple touch points on the touch screen, for example, User A simultaneously touches the two contact positions generated by the index finger of the left hand and the right hand on the screen.
  • user A draws a line on the touch screen with a finger or a touch pen.
  • the touch screen obtains the position information of the touch operation for many times in a row.
  • the position information obtained each time includes a single touch point position, which represents the touch point positions generated at different times during the sliding process of user A drawing a line.
  • the touch screen sends the position information of the touch operation to the processor.
  • the touch screen may periodically or triggered or in real time send the position of the touch point generated by the detected touch operation to the processor.
  • a touchscreen can send a frame of data to the processor every cycle.
  • the frame data includes the position of the contact point detected in the period; it can be understood that when the contact point is not detected in the period, the frame data sent will not include the position of the contact point.
  • the touch screen may send the position of a single or multiple contacts in the detection period to the processor at a certain time period; in order to maintain the sensitivity and real-time response to the touch operation, the above time period is usually very short (for example, ms level), Therefore, during the continuous process that the user A completes a complete touch operation, the touch screen usually sends the position information of the touch operation to the processor multiple times.
  • the touch screen when it sends the position information of the touch operation to the processor, it also carries the touch state information corresponding to each contact position in the position information, such as "pressed”, “raised” ” or “moving” and other states.
  • the processor receives the position information of the touch operation, and determines whether the touch operation matches the predetermined first rule.
  • the processor receives the position information of the touch operation in real time.
  • the processor usually receives the position information of the touch operation multiple times during the process that the user A completes a complete touch operation. Therefore, when the processor receives the position information of the touch operation, it will first determine whether a certain user (here, user A is taken as an example) has completed the touch operation this time. Specifically, the processor may determine whether the touch operation is completed by tracking the touch operation. For example, the processor starts timing each time the position information of the touch operation is received, and if the position information of the touch operation is not received again within a predetermined time interval, it is determined that the current touch operation operation is completed; From the moment when the operation ends, the position information of the subsequently received touch operation is used as the position information of the next touch operation.
  • a certain user here, user A is taken as an example
  • the position information of the touch operation is received again within the predetermined time interval, it is considered that the current touch operation has not been completed, and the position information received again is used as the position information of the current touch operation. Then start timing from the moment when the position information of the touch operation is received again, and continue to judge according to the above process; until after the position information of a certain touch operation is received, no touch operation is received again within a predetermined time interval. position information, the touch operation is considered to be over.
  • the predetermined time interval is set according to an empirical value or an actual situation, which is not limited in this application.
  • the processor may also determine whether the touch operation ends according to the touch state information carried in the position information of the touch operation. For example, when the touch state is "pressed” or "moved”, it is determined that the touch operation is not completed; when the touch state is "raised”, it is determined that the touch operation has been completed.
  • the above-mentioned method for the processor to determine whether the touch operation is completed is only an exemplary method. According to the different driving implementation manners of different touch screen manufacturers, the method by which the processor determines whether the touch operation is completed may also be different, which is not limited in this embodiment of the present application.
  • the processor determines whether the touch operation matches the predetermined first rule.
  • the first rule is any pre-defined or configured rule about touch operation, and the rule is used to determine whether user A wants to start voice input.
  • the predetermined first rule can be "any point on the touch screen”; it can also be “two points on the touch screen, and the line connecting the two points is parallel to the horizontal direction of the touch screen, and the distance between the two points is different. is lower than the preset value M"; it can also be “a touch track, and the touch track is parallel to the horizontal direction of the touch screen, and the length is not less than the preset value”.
  • the processor can determine whether the received one or more touch point position sets generated by the current touch operation satisfy the conditions defined by the first rule above, and if so, the touch operation can be considered to match the predetermined first rule.
  • the predetermined first rule can also be an image containing a certain shape.
  • the processor can use the image matching technology to form the position information of the touch operation received during the current touch operation. The shape of , and the predetermined shape are compared, and if the results of the comparison are consistent, it can be considered that the touch operation matches the predetermined first rule.
  • the content of the predetermined first rule and the manner of judging whether the touch operation matches the predetermined first rule may be other contents and manners than the above examples, which are not limited in this application.
  • the processor determines that the touch operation matches the predetermined first rule, it preliminarily determines that user A wants to start voice input, and continues to perform the subsequent steps in the method; on the contrary, if the processor determines that the touch operation of user A matches If it does not match the predetermined first rule, it is considered that user A is another touch operation, and the process ends.
  • the processor determines the output position on the touch screen according to the position information of the touch operation.
  • the processor when the processor determines that the touch operation conforms to the predetermined first rule, it can determine that user A wants to start voice input, and can also determine the output position on the touch screen indicated by user A from the position information of the touch operation.
  • the output position is used to indicate the starting position and range of the voice content output on the touch screen, that is, the position and size of the output box displaying the voice content.
  • the following takes the case where the shape of the output frame is a rectangle as an example to describe the content included in the output position and the manner of determination.
  • the output position includes the start position of the output box, and the width and height of the output box. If the output sequence from left to right and top to bottom is taken as an example, the starting position of the above-mentioned output box may be the coordinates of the upper left corner of the output box.
  • the above output position can be determined in the following manner: take the upper left corner of the touch screen as the origin, one side with the origin as the endpoint as the horizontal axis, and the other side with the origin as the endpoint as the vertical axis.
  • the processor may use the contact A or B as the first contact, and the contact C or D as the second contact. Then take the first contact as the starting position of the output box, take the horizontal coordinate difference between the second contact and the first contact as the width of the output box, and take the vertical coordinate difference between the second contact and the first contact as The height of the output box.
  • the contact point A When the contact point A is used as the first contact point and the contact point C is the second contact point, the starting position of the output frame, as well as its width and height are shown in FIG. 5A .
  • the contact point A When the contact point A is used as the first contact point and the contact point D is the second contact point, the starting position of the output frame, as well as its width and height are shown in FIG. 5B .
  • the contact B is used as the first contact, and the contact C or D is used as the second contact, the same can be obtained.
  • the above-mentioned determination method of the output position may also be: taking the point at the upper left corner of the touch screen as the origin, from the set of contacts included in the position information of the received touch operation, the point formed by the smallest horizontal coordinate and the smallest vertical coordinate. (This point may not be an actual contact) as the first contact, and the point formed by the largest horizontal coordinate and the largest vertical coordinate is used as the second contact (this point may not be an actual contact). Then, in a similar manner as described above, the starting position of the output box, as well as its width and height, can be determined. For example, in the set of contacts shown in FIGS.
  • a point (xa, yb) formed by the horizontal coordinate xa of the contact point A and the vertical coordinate yb of the contact point B is taken as the first contact point.
  • a point (xd, yc) formed by the horizontal coordinate xd of the contact point D and the vertical coordinate yc of the contact point C is taken as the second contact point.
  • the starting position of the output box, and its width and height are shown in FIG. 5C .
  • the output position includes a start position and an end position of the output box. If the output sequence from left to right and top to bottom is still taken as an example, the start and end positions of the output box are the coordinates of the upper left corner and the lower right corner of the output box, respectively.
  • the processor may determine the coordinates of the upper left corner and the lower right corner of the output frame from the set of contacts included in the received position information of the touch operation. For example, the coordinates of the first contact point and the second contact point determined in the first implementation manner can be used as the coordinates of the upper left corner and the lower right corner of the output box, respectively, to determine the starting position and end position. The specific process is not repeated here.
  • the system may define the height of the output box to be a preset value. In this implementation, this step only needs to determine the starting position and width of the output box. Alternatively, the system may also define that the height and width of the output box are both preset values. In this implementation, only the starting position of the output box needs to be determined in this step.
  • the content contained in the above-mentioned output position and the determination method are only exemplary descriptions; in some implementations, the system may also define the shape of the above-mentioned output box for displaying the voice content to be other shapes than rectangles, such as diamonds, circles, etc. .
  • the content included in the output location and the determination manner may be other implementation manners than the foregoing first and second implementation manners, which are not limited in this embodiment of the present application.
  • the processor may also call a rendering instruction to generate the output position according to the determined output position (ie, the position and size of the output frame), so that the output position can be displayed on the display screen.
  • the above output position can be generated and output in the form of a frame, and the frame of the output frame can be a black dashed frame as shown in Figure 5A-5C, a dashed frame of other colors, or a black solid frame or Solid frame in other colors.
  • the processor may display the output location by changing the background color of the output location, or in any manner that distinguishes the output location from the background currently displayed on the display screen. This embodiment of the present application does not limit this.
  • the voice collector collects and sends the first voice to the processor.
  • the voice collector collects the first voice signal in real time, and sends the collected first voice to the processor.
  • the above-mentioned first voice may be sent by user A in the conference site, or may be sent by other users in the conference site.
  • the processor receives the first voice, and determines whether the first voice matches the predetermined second rule.
  • the second rule is used to determine whether the user who made the first voice, such as user A, wants to start voice input.
  • the predetermined second rule may be a text of a specific content, such as "start voice input”.
  • the method for judging whether the first voice matches the predetermined second rule may be to first convert the received first voice signal into text by using the technology of voice recognition, and then use the method of text matching to judge the content of the received first voice Whether it is consistent with the specific content above. If they are consistent, it is determined that the first speech matches the predetermined second rule; otherwise, it can be determined that they do not match.
  • the text matching method used by the processor can be not only literal matching, but also semantic level matching. For example, if the content of the first voice received by the processor is "start voice input", the processor determines that the content of the voice matches the above-mentioned specific content "start voice input” at the semantic level, and thus judges that the first voice matches the predetermined content. The second rule matches.
  • the content of the above-mentioned predetermined second rule and the manner of judging whether the first voice matches the predetermined second rule may be other contents and manners than the above-mentioned examples, which are not limited in this embodiment of the present application.
  • step S406 if the processor determines that the first voice matches the predetermined second rule, and in combination with the above step S406, the processor has received a touch operation that meets the first rule, it is considered that the first voice was created by the user It is considered that the touch operation received in step S406 and the first voice received in this step are executed and issued by the same user respectively, and it is confirmed that user A wants to start voice input, then continue to execute this method next steps.
  • the processor determines that the first voice does not match the predetermined second rule, it is considered that the first voice is not uttered by user A, that is, the touch operation received in step S406 and the first voice received in this step are considered to be the same It is not executed and uttered by the same user, for example, it may be a random voice uttered by other users in the venue.
  • the processor confirms that it is not user A who wants to start the voice input, and the subsequent steps are not executed.
  • the processor establishes a correspondence between the voiceprint of the first voice and the output position.
  • the processor performs denoising and other processing on the first voice signal, extracts voiceprint features of the first voice, such as acoustic or linguistic features, from the processed signal, and establishes the voiceprint of the first voice, that is, the user
  • voiceprint features of the first voice such as acoustic or linguistic features
  • the specific process of establishing the corresponding relationship may include: using the voiceprint feature data of user A as the key, and using the output position on the touch screen determined in S405 (assuming that the output position is (x1, y1, w1, h1)) as the value, according to
  • the voiceprint library shown in Table 1 is stored in the form of a dictionary or a hash table, so that the output position of the corresponding touch screen of user A can be determined through the voiceprint of user A in subsequent steps.
  • the voice collector collects the second voice, and sends the second voice to the processor.
  • the voice collection device will collect the voices of different people. Therefore, the second voice may be sent by user A or may be sent by other users.
  • the processor receives the second voice, and determines whether the voiceprint of the second voice matches the voiceprint of the first voice.
  • the processor since the second voice may be uttered by any user, in some scenarios, user A does not want to be disturbed by the voice of other users during the process of outputting the voice to the output location. Therefore, when the processor receives the second voice, it will determine whether the second voice is sent by user A.
  • the processor may adopt a processing method similar to the first voice in S414 for the second voice, perform denoising and other processing on the second voice, and then extract its voiceprint features; and then use the voiceprint recognition technology to determine the second voice. Whether the voiceprint of the voice matches the voiceprint of user A in the voiceprint database shown in Table 1, it can be determined whether the second voice is uttered by user A.
  • step S424 is executed. If the result of the judgment is no, indicating that it is not the voice of user A, the voice signal can be discarded, and the process returns to step S418 to re-execute the process of collecting and judging the voice signal.
  • the processor recognizes the content of the second voice, and outputs and displays the content to the output position on the touch screen.
  • the processor can convert the second voice signal into text content by adopting the technology of voice recognition, and then output and display the text content to the output position on the touch screen; wherein, the output position refers to the relationship between the first and the second voice signal established in S416.
  • the output position corresponding to the voiceprint of the voice.
  • the process of the processor outputting the second voice content may be: according to the output position information and the text content, calling a rendering instruction to generate a text window containing the second voice content, and then combining the generated text window with the generated text window according to the output position information.
  • Other windows (such as the system background window) are fused to generate a pixel array, and then the pixel array is output to a predetermined storage area (such as the system's frame buffer).
  • a display signal for example, a vertical synchronization signal
  • the content of the predetermined storage area can be displayed on the touch screen.
  • the embodiment of the present application can also continue to collect the third voice, and adopt a processing and judgment process consistent with the second voice for the third voice; in this way, after the user A successfully starts the voice input function, the Content is continuously output to the output box (output position information indicated by the above-mentioned touch operation) indicated by the user A through the voice input.
  • a scroll bar in the horizontal or vertical direction of the output box can be generated according to the situation to ensure the voice input by user A.
  • the content is recorded, and all the recorded content can be displayed through the scroll bar.
  • the processor may output the most recently input content in the output location by default.
  • the voice collector can use an array microphone. Since the array microphone can locate the position of the sound source, in another implementation, the following step 415 may be further included between the above-mentioned steps S414 and S416.
  • the processor determines whether the position of the sound source of the first voice and the position information of the touch operation satisfy a preset condition.
  • the position of the user who performs the touch operation is usually near the position of the touch operation. Therefore, the processor can verify whether the position of the sound source to which the first voice belongs and the position information of the touch operation meet the preset conditions to verify that the first voice is issued. Whether it is user A (that is, it is further verified whether the user who performed the touch operation received in step S406 and the user who made the first voice received in step S414 are the same user).
  • the processor receives the first voice collected by the array microphone, calculates the position of the sound source to which the first voice belongs according to the first voice, and obtains the position of the sound source relative to the array microphone.
  • the array microphone is usually integrated on the upper or lower edge of the touch screen (that is, the integrated position of the array microphone is parallel to the horizontal direction of the touch screen, and where the point coincides with the horizontal midpoint of the touch screen), therefore, the position of the sound source relative to the array microphone can also be taken as the position of the sound source relative to the touch screen; the position includes the vertical distance of the sound source relative to the touch screen , and the horizontal distance relative to the midpoint in the horizontal direction of the touch screen.
  • the process of using the array microphone to calculate the position of the sound source is as follows:
  • MC1 and MC2 are used as the sub-mic array at the left end of the microphone array, and C1 is the midpoint of the sub-mic array at the left end; similarly, MC3 and MC4 are used as the sub-mic array at the right end, and C2 is the sub-mic array at the right end. The midpoint of the microphone array.
  • the angle between the connection line between the sound source and C1 and the array microphone is calculated as ⁇ 1; It can be calculated that the angle between the line connecting the sound source and C2 and the array microphone is ⁇ 2.
  • the vertical distance H of the sound source relative to the array microphone (that is, the vertical distance relative to the touch screen) can be calculated according to the trigonometric function relationship, and the relative The horizontal distance from the midpoint of the array microphone (that is, the horizontal distance relative to the midpoint in the horizontal direction of the touch screen) W.
  • the vertical distance of the position of the touch operation relative to the touch screen is 0, and the horizontal distance relative to the midpoint in the horizontal direction of the touch screen can be obtained according to the horizontal coordinates in the position information of the touch operation; therefore, the touch screen can be used as a reference,
  • the vertical distance between the sound source and the position of the touch operation that is, the vertical distance between the sound source and the touch screen
  • the horizontal distance between the sound source and the position of the touch operation are obtained.
  • the vertical distance between the sound source and the position of the touch operation does not exceed a predetermined range (for example, it is set to 0.5m based on empirical values), and the horizontal distance between the sound source and the position of the touch operation does not exceed a predetermined range (for example, it is set to 0.5m).
  • the first voice is user A (that is, when the result of the judgment in this step is yes, the user who performs the touch operation and the user who emits the first voice are considered to be the same user) ; otherwise, it is considered that it is not the user A who issued the first voice (that is, when the judgment result of this step is no, it is considered that the user who performs the touch operation and the user who emits the first voice are not the same user).
  • the above-mentioned way of judging that the user who performs the touch operation and the user who makes the first voice are not the same user can also be implemented by collecting and tracking the image in front of the touch screen.
  • the camera is deployed on the touch screen, for example, at the center of the upper edge of the touch screen (such as the position shown in FIG. 2 or FIG. 3 ), and an image near the front of the touch screen can be collected and sent to the processor in real time; processing
  • the device analyzes and tracks the images collected and sent by the camera in real time, for example, analyzes and tracks the human body movements and lip movement information in the images, so as to determine whether the user who performs the touch operation and the user who makes the first voice are the same user.
  • the processor can identify the facial features of the user who performs the touch operation through the machine learning algorithm, such as the facial feature Face-A of user A, and associate the Face-A with the position indicated by the touch operation of user A, such as (x1,y1,w1,h1) binding. Then calculate the facial features of user A, that is, the angle a1-A of Face-A relative to the camera.
  • the angle a2-A of the sound source of the first speech relative to the midpoint of the array microphone can be obtained through the above sound source position calculation. Then, by comparing the above-mentioned a1-A and a2-A, if the difference between the two is within a predetermined range, it is considered that the first voice and Face-A belong to the same user.
  • the user who uttered the first voice is the same user, so the voiceprint of the first voice can be associated with (x1, y1, w1, h1).
  • the above method can avoid establishing a corresponding relationship between user A's voiceprint and the output position indicated by the other user, thus improving the robustness of the system sex.
  • the detection and judgment of user A's touch operation in the above-mentioned steps S402-S406, and the collection and judgment of user A's voice signal in steps S410-S414 are two steps to determine whether user A wants to start voice input. These two conditions are judged in no order, that is, the execution of steps S402-S406 and S410-S414 has no existing order.
  • the detection and judgment of touch operations can be performed first, and the collection and judgment of voice signals can also be performed first.
  • the terminal device determines when user A wants to start voice input according to user A's touch operation and voice input, and associates user A's voiceprint with the text output position indicated by user A. In this way, when there are multiple people speaking in the venue (including user A), only the voice content of user A can be displayed to the text output position indicated by user A, rather than the voice content of other users in the venue, through voiceprint matching. (and not elsewhere on the touchscreen).
  • the first possible implementation may be: user A can instruct the system to release user A's voiceprint and output position (x1, y1) by uttering a voice that complies with the third rule (for example, the voice content matches "close voice input") , w1, h1) correspondence.
  • the processor detects a voice that conforms to the third rule, it extracts the voiceprint of the voice, and compares the voiceprint with the voiceprint in the voiceprint database saved by the system one by one, and then matches the voiceprint with the voiceprint.
  • the voiceprints and their corresponding output positions are removed from the voiceprint library.
  • user A instructs the system to start the voice input again through the above-mentioned similar process (performing a touch operation that conforms to the first rule and issuing a voice that conforms to the second rule), and establishes a newly designated output position (x3, y3, w3, h3) correspondence.
  • the second possible implementation manner may be: User A directly instructs the system to restart the voice input through the above-mentioned similar process, and establishes the corresponding relationship between its voiceprint and the newly designated output position (x3, y3, w3, h3).
  • the processor receives a voice that conforms to the second rule, it determines that the voiceprint of the voice already exists in the voiceprint table saved by the system, and directly updates the output position corresponding to the voiceprint, thereby converting The output position corresponding to user A's voiceprint is updated to (x3, y3, w3, h3).
  • the present application can also simultaneously display the voice output of user B to the output position indicated by user B on the touch screen.
  • the processor will add and save the correspondence between the voiceprint of user B and the output position indicated by user B on the basis of Table 2 above.
  • the voiceprint library saved by the system is as shown in Table 3.
  • step S415 needs to be used to further verify whether the user who performs the touch operation and the user who makes the voice are the same user, so that the voiceprint of user A can be avoided.
  • a corresponding relationship is established with the output position indicated by user B, thereby improving the robustness of the system.
  • the processor may compare the voiceprint of the second voice with the voiceprints in the voiceprint library shown in Table 3 one by one. If it can match with one of the voiceprints, it is considered that the judgment result of S422 is yes. Conversely, if the second voice cannot match any one of the voiceprints in the voiceprint database saved by the system, it is considered that the judgment result of S422 is no, and the second voice is not output.
  • the user identity corresponding to the second voice is identified through voiceprint matching, which can avoid the confusion and display of the voice content of different users on the touch screen due to the inability to distinguish the identity of the user who made the voice when multiple people input through voice at the same time. on the problem.
  • user A has released the binding relationship with a specific output position
  • user C may want to specify the content of the output position in the history of user A, such as output position (x1, y1, w1, h1) content, modify or supplement.
  • the system establishes the correspondence between the voiceprint of user C and the output position (x1, y1, w1, h1) based on the same implementation, and additionally outputs the voice content of user C in the output box (x1, y1, w1, h1 ) after the existing content.
  • the system establishes the correspondence between the voiceprint of user C and the output position (x1', y1', w1', h1') based on the same implementation, and outputs the voice content of user C in the output box (x1', y1', w1', h1'); where (x1', y1', w1', h1') is the output box superimposed on (x1, y1, w1, h1). Therefore, in the first possible scenario, the content of the voiceprint library saved by the system may be as shown in Table 4A-1 or Table 4A-2.
  • Voiceprint (key) output location (value)
  • a Voiceprint Feature A (x3,y3,w3,h3)
  • B Voiceprint Feature B (x2,y2,w2,h2)
  • C Voiceprint Feature C (x1', y1', w1', h1')
  • the system can establish the correspondence between user C and the output position (x3, y3, w3, h3). Since the voiceprints of user A and user C correspond to the same output position, user A and user C can Input the voice content in x3, y3, w3, h3), at this time, the system can output to (x3, y3, w3, h3) in sequence according to the time sequence of the voice collection of user A and user C.
  • the system can also realize the input of user C to user A by establishing the corresponding relationship between user C and the output position (x3', y3', w3', h3') superimposed on (x3, y3, w3, h3) the content of which is supplemented or modified.
  • This process is similar to the above implementation description in the first possible scenario, and thus will not be repeated here. Therefore, in the second possible scenario, the content of the voiceprint library saved by the system may be as shown in Table 4B-1 or Table 4B-2.
  • the system can differentiate and display the voice content input by user A and user C.
  • the content supplemented by user C or the superimposed output content can be displayed on the touch screen in a format different from user A's voice content, for example, in a different color, font or other format than when user A's voice text is displayed.
  • the voice content of user C is displayed, and the present application does not limit the manner in which the system adopts different formats to display the voice content of different users.
  • the embodiment of the present application can also allow multiple users to input content on the touch screen through voice in a cooperative manner, which improves input efficiency and experience.
  • the human-computer interaction system may be divided according to the foregoing method examples.
  • the human-computer interaction system shown in FIG. 1 can be used to perform a human-computer interaction method, for example, to perform the method shown in FIG. 4 .
  • the human-computer interaction system includes: a touch screen 13 , a processor 11 , and a memory 12 . in,
  • the touch screen 13 is configured to receive the touch operation and send the position information of the touch operation to the processor 11 .
  • the processor 11 is configured to establish a correspondence between the first voiceprint and the first output position on the touch screen 13 .
  • the processor 11 is further configured to receive the first voice, and when judging that the voiceprint of the voice matches the first voiceprint, identify the content of the voice, and output and display the content of the voice to the first voiceprint. at the output location.
  • Memory 12 is used to store data, instructions or program codes.
  • the processor 11 calls and executes the instructions or program codes stored in the memory 12, the human-computer interaction method provided by the embodiments of the present application can be implemented.
  • the processor 11 when establishing the correspondence between the first voiceprint and the first output position on the touch screen 13, the processor 11 is specifically used for: receiving the position information of the touch operation and receiving the second voice; When the touch operation matches the predetermined first rule and the second voice matches the predetermined second rule, the first output position on the touch screen 13 is determined according to the position of the touch operation, and the second voice The first voiceprint is extracted from the first voiceprint, and then the corresponding relationship between the first voiceprint and the first output position on the touch screen 13 is established. For example, in conjunction with FIG.
  • the touch screen 13 can be used to execute S402 and S404
  • the processor 11 can be used to execute S406 , S408 , S414 , S416 , S422 , S424 and S426 .
  • the processor 11 when determining the first output position on the touch screen 13 according to the position of the touch operation, is specifically configured to: determine the first output position on the touch screen 13 according to the set of contact positions in the positions of the touch operation. The starting position and range of an output position.
  • the processor 11 is further configured to output and display the first output position on the touch screen 13 .
  • the processor 11 is further configured to release the correspondence between the first voiceprint and the first output position on the touch screen 13 .
  • the processor 11 is further configured to establish a correspondence between the second voiceprint and the first output position on the touch screen 13 .
  • the processor 11 when receiving the third voice and judging that the voiceprint of the third voice matches the second voiceprint, the processor 11 recognizes the content of the third voice, and outputs and displays the content to the The blank space of the first output position of the touch screen 13 may cover the content of the first voice.
  • the processor 11 since the processor 11 is further configured to select an output format different from when outputting the content of the first voice when outputting the content of the third voice.
  • the processor 11 when the second voice received by the processor 11 is collected by an array microphone, it is also used to calculate the position of the sound source of the second voice, and establish the first voiceprint and the first voice in the above-mentioned process. Before outputting the correspondence between the positions, it is also used to determine whether the sound source position of the second voice and the position of the touch operation satisfy a preset condition.
  • the processor 11 may be configured to perform S415.
  • the processor 11 is further configured to receive an image near the front of the touch screen 13, and after judging that the touch operation matches the predetermined first rule and the first voice matches the predetermined second rule, further It is used to analyze and track the image in real time, and determine whether the user who performs the touch operation and the user who makes the first voice are the same user according to the results of the image analysis and tracking.
  • the touch screen 13 is further configured to display the text content, the output position or the scroll bar according to the instruction of the processor 11 .
  • the voice collector 14 can be used to collect voice, and send the collected voice to the processor 11 .
  • the voice collector 14 may be used to perform S410 , S412 , S418 and S420 .
  • the image collector 15 can be used to collect an image near the front of the touch screen 13 and send the image to the processor 11 in real time.
  • the above-mentioned processor or a computer device including the processor may be divided into functional modules according to the above-mentioned method examples.
  • each functional module may be divided according to each function, or two or more The functions are integrated in a processing module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. It should be noted that, the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
  • FIG. 7 it is a schematic structural diagram of a processor or a computer device according to an embodiment of the present application.
  • the processor or computer device is used to execute the above-mentioned human-computer interaction method, for example, to execute the method shown in FIG. 4 .
  • the processor or computer device may include a speech processing unit 702 and an integrated processing unit 704 .
  • the voice processing unit 702 is configured to receive the first voice, and recognize the content of the second voice when judging that the voiceprint of the first voice matches the first voiceprint.
  • the integrated processing unit 704 establishes the correspondence between the first voiceprint and the first output position; and is further configured to output the content of the first voice to the first output position.
  • the above-mentioned processor or computing device further includes a touch point position processing unit 701 for receiving a touch point position generated by a touch operation.
  • the contact position processing unit 701 is further configured to determine the first output position according to the contact position when it is judged that the contact position matches the predetermined first rule.
  • the voice processing unit 702 is further configured to receive the second voice, and extract the first voiceprint from the first voice when it is judged that the second voice matches the predetermined second rule.
  • the touch point position processing unit 701 can be used to execute S406 and S408, the voice processing unit 702 can be used to execute S414, S422 and S424, and the integrated processing unit 704 can be used to execute S416 and S426.
  • the contact position processing unit 701 when determining the first output position according to the contact position, is specifically configured to: determine the starting position and range of the first output position according to the set of contact positions in the contact position .
  • the integrated processing unit 705 is further configured to output the first output position.
  • the integrated processing unit 704 is further configured to release the correspondence between the first voiceprint and the first output position.
  • the voice processing unit 702 is further configured to receive a third voice, and when judging that the voiceprint of the third voice matches the second voiceprint, identify the content of the third voice.
  • the comprehensive processing unit 704 is further configured to establish a correspondence between the second voiceprint and the first output position, and output the content of the third voice to the first output position, for example, to the first output position or cover the content of the above-mentioned first voice.
  • the comprehensive processing unit 704 is further configured to determine whether the position of the sound source of the second voice and the position of the touch point satisfy a preset condition before establishing the corresponding relationship between the identifier of the first voiceprint and the first output position.
  • the integrated processing unit 704 may be used to perform S415.
  • the processor or the computer device may further include an image processing unit 703 for receiving the image, and before the integrated processing unit 704 establishes the corresponding relationship between the identifier of the first voiceprint and the first output position, analyzes it in real time. and track the image, and then determine whether the user who performs the touch operation and the user who utters the first voice are the same user according to the analysis and tracking results.
  • an image processing unit 703 for receiving the image
  • the integrated processing unit 704 establishes the corresponding relationship between the identifier of the first voiceprint and the first output position
  • the functions implemented by the touch point position processing unit 701 , the voice processing unit 7022 , the integrated processing unit 704 and the image processing unit 703 in the computer device are the same as those of the processor 11 in FIG. 1 .
  • Another embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium.
  • the human-computer interaction system or computer device executes the instructions. Each step performed by the human-computer interaction system or the computer device in the method flow shown in the above method embodiments.
  • the disclosed methods may be implemented as computer program instructions encoded in a machine-readable format on a computer-readable storage medium or on other non-transitory media or articles of manufacture.
  • FIG. 8 schematically shows a conceptual partial view of a computer program product provided by an embodiment of the present application, where the computer program product includes a computer program for executing a computer process on a computing device.
  • the computer program product is provided using signal bearing medium 90 .
  • the signal bearing medium 90 may include one or more program instructions that, when executed by one or more processors, may provide the functions, or portions thereof, described above with respect to FIG. 4 .
  • reference to one or more features of S402 - S426 in FIG. 4 may be undertaken by one or more instructions associated with signal bearing medium 90 .
  • the program instructions in FIG. 8 also describe example instructions.
  • the signal bearing medium 90 may include a computer readable medium 91 such as, but not limited to, a hard drive, a compact disc (CD), a digital video disc (DVD), a digital tape, a memory, a read only memory (read only memory) -only memory, ROM) or random access memory (RAM), etc.
  • a computer readable medium 91 such as, but not limited to, a hard drive, a compact disc (CD), a digital video disc (DVD), a digital tape, a memory, a read only memory (read only memory) -only memory, ROM) or random access memory (RAM), etc.
  • the signal bearing medium 90 may include a computer recordable medium 92 such as, but not limited to, memory, read/write (R/W) CDs, R/W DVDs, and the like.
  • a computer recordable medium 92 such as, but not limited to, memory, read/write (R/W) CDs, R/W DVDs, and the like.
  • signal bearing medium 90 may include communication medium 93 such as, but not limited to, digital and/or analog communication media (eg, fiber optic cables, waveguides, wired communication links, wireless communication links, etc.).
  • communication medium 93 such as, but not limited to, digital and/or analog communication media (eg, fiber optic cables, waveguides, wired communication links, wireless communication links, etc.).
  • Signal bearing medium 90 may be conveyed by a wireless form of communication medium 93 (eg, a wireless communication medium conforming to the IEEE 802.11 standard or other transmission protocol).
  • the one or more program instructions may be, for example, computer-executable instructions or logic-implemented instructions.
  • a human-computer interaction system or computer device such as that described with respect to FIG.
  • a program instruction that provides various operations, functions, or actions.
  • the computer may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • a software program it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer-executed instructions are loaded and executed on the computer, the flow or function according to the embodiments of the present application is generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • Computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website site, computer, server, or data center over a wire (e.g.
  • Coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.) means to transmit to another website site, computer, server or data center.
  • Computer-readable storage media can be any available media that can be accessed by a computer or data storage devices including one or more servers, data centers, etc., that can be integrated with the media.
  • Useful media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.

Abstract

人机交互方法、装置和系统,涉及人机交互技术领域,有助于实现将一个或多个用户的语音内容输出并显示到触摸屏上对应的输出位置,提升触摸屏输入的效率和用户的体验。该人机交互方法包括:建立第一声纹与触摸屏上的第一输出位置的对应关系;接收第一语音,在判断所述第一语音的声纹与所述第一声纹匹配时,识别该语音的内容,并将此内容输出并显示到所述第一输出位置。

Description

人机交互方法、装置和系统
本申请要求于2020年9月17日提交的申请号为202010983742.6、发明名称为“人机交互方法、装置和系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人机交互技术领域,尤其涉及人机交互方法、装置和系统。
背景技术
人机交互(human computer interaction,HCI;或者,human machine interaction,HMI)是一门研究系统与用户之间的交互关系的学问。这里的系统可以是各种各样的机器,也可以是计算机化的系统和软件。在涉及触摸屏输入的人机交互中,可以通过触摸操作来实现输入。但由于成本和技术的原因,通过触摸操作来实现触摸屏上输入的方式,经常会出现手写连笔和笔迹难控制的情况,导致输入困难和输入效率低下。
发明内容
本申请提供的人机交互方法、装置和系统,通过多模态(modality)人机交互的方式,有助于实现将一个或多个用户的语音内容输出并显示到触摸屏上对应的输出位置,提升触摸屏输入的效率和用户的体验。
为达上述目的,本申请提供如下技术方案:
第一方面,本申请提供了一种人机交互方法,该人机交互方法应用于人机交互系统。该方法包括:首先建立第一声纹与触摸屏上的第一输出位置的对应关系;然后在接收到第一语音、并判断该语音的声纹与上述第一声纹匹配时,识别该语音的内容,并将识别的内容输出并显示上述第一输出位置。
通过本申请的第一方面,可以实现将一个或多个用户输入的语音内容输出并显示到触摸屏上用户指示的输出位置上,避免多人同时通过语音输入时,将不同用户的语音内容混淆显示在触摸屏上的问题,因此可以提升触摸屏输入的效率和体验。
结合第一方面,在一种可能的实现中,上述建立第一声纹与触摸屏上的第一输出位置的对应关系的过程,可以为:接收触摸操作以及第二语音,并且在判断该触摸操作与预定的第一规则匹配、且该第二语音与预定的第二规则匹配时,根据所述触摸操作的位置确定触摸屏上的第一输出位置,并从上述第二语音中提取第一声纹;然后建立该第一声纹与上述第一输出位置的对应关系。
结合第一方面,在一种可能的实现方式中,上述根据所述触摸操作的位置确定第一输出位置的过程,可以为:根据所述触摸操作的位置中的触点位置集合确定所述第一输出位置的起始位置和范围。
可选地,上述第一输出位置所指示的形状可以为矩形,圆形或菱形等。当上述第一输出位置所指示的形状为矩形时,所述第一输出位置可以包括左上角的坐标、以及宽度和高度。或者,所述第一输出位置还可以包括其起始位置和结束位置,例如第一输出位置的左上角坐标和右下角坐标。
可选地,系统还可以输出并在触摸屏上显示该第一输出位置。例如以边框、或者其他可 以与触摸屏当前背景相区分的形式显示所述第一输出位置。此外,当上述第一语音的内容超过了第一输出位置所能输出的上限时,系统还可以在第一输出位置的水平或处置方向生成、并在触摸屏上显示该滚动条。这种实现方式下,可以保证用户输入的语音内容都被记录下来,并且便于用户随时翻看任何被记录的语音内容,因而可以进一步提升用户的体验。
结合第一方面,在一种可能的实现方式中,上述判断触摸操作与预定的第一规则匹配的过程,可以为:判断所述触摸操作的位置中的触点位置集合与预定的位置规则一致。或者,识别所述触摸操作的位置中的触点位置集合所构成的形状,判断所述形状与预定的形状匹配。
结合第一方面,在一种可能的实现方式中,上述判断第二语音与预定的第二规则匹配的过程,可以为:识别该第二语音的内容,并判断此内容与预定的内容匹配。
可选地,系统还可以解除所述第一声纹与所述第一输出位置的对应关系。这样,可以建立所述第一声纹与其他输出位置的对应关系,以便于第一声纹对应的用户可以切换到其他输出位置输入其语音内容。
可选地,系统还可以建立第二声纹与所述第一输出位置的对应关系。并且,在接收到第三语音,并且判断该语音的声纹与所述第二声纹匹配时,可以将该语音的内容输出到所述第一输出位置的空白处,或者覆盖上述第一语音的内容。这样,可以实现某个用户通过语音(上述第三语音)对其他用户的语音内容(上述第一语音)进行补充或修改的功能。因此,通过这种实现方式可以让多个用户相互协作通过语音在触摸屏上输入内容,因而可以提升触摸屏输入的效率和体验。此外,可选地,由于上述第三语音和第一语音是由不同的用户发发出的,因此,系统还可以在输出第三语音的内容时,选择不同于输出上述第一语音的内容时的输出格式,使得显示的效果更好。
可选地,当上述第二语音是由阵列麦克风采集时,还可以计算该第二语音的声源的位置,并在上述建立所述第一声纹与所述第一输出位置的对应关系之前,判断所述第二语音的声源位置与所述触摸操作的位置是否满足预设的条件。如果满足,才建立所述第一声纹与所述第一输出位置的对应关系;否则,则不建立所述对应关系。
可选地,系统还可以接收图像采集器采集的触摸屏前方的图像,并在上述建立所述第一声纹与所述第一输出位置的对应关系之前,实时分析和跟踪上述图像的内容,并根据图像分析和跟踪的结果判断执行所述触摸操作的用户与发出所述第二语音的用户是否为同一个用户。或者,通过联合声源定位和图像跟踪来判断执行触摸操作的用户与发出所述第二语音的用户是否为同一个用户。如果判断是同一个用户,才建立所述对应关系;否则,则不建立所述对应关系。
通过上述两种方式,当有多个用户同时启动语音输入时,可以避免将一个用户的声纹与另一个用户所指示的输出位置绑定起来,因而可以提升系统的鲁棒性。
第二方面,本申请提供了一种人机交互方法,该人机交互方法应用于计算机设备,该方法包括:首先建立第一声纹与第一输出位置的对应关系;然后在接收到第一语音、并判断该语音的声纹与上述第一声纹匹配时,识别该语音的内容,并将此内容输出到上述第一输出位置。
结合第二方面,在一种可能的实现中,上述建立第一声纹与第一输出位置的对应关系的过程,可以为:接收触点位置以及第二语音,并且在判断该触点位置与预定的第一规则匹配、且该第二语音与预定的第二规则匹配时,根据所述触点位置确定第一输出位置,并从上述第二语音中提取第一声纹;然后建立该第一声纹与上述第一输出位置的对应关系。
结合第二方面,在一种可能的实现方式中,上述根据所述触点位置确定第一输出位置的过程,可以为:根据触点位置中的触点位置集合确定所述第一输出位置的起始位置和范围。
可选地,上述第一输出位置所指示的形状可以为矩形,圆形或菱形等。当上述第一输出位置所指示的形状为矩形时,所述第一输出位置可以包括左上角的坐标、以及宽度和高度。或者,所述第一输出位置还可以包括其起始位置和结束位置,例如第一输出位置的左上角坐标和右下角坐标。
可选地,所述计算机设备还可以输出该第一输出位置;例如以边框、或者其他可以与当前背景相区分的形式显示所述第一输出位置。
结合第二方面,在一种可能的实现方式中,上述判断触点位置与预定的第一规则匹配的过程,可以为:判断所述触点位置中的触点位置集合是否与预定的位置规则一致。或者,识别所述触点位置中的触点位置集合所构成的形状,判断所述形状是否与预定的形状匹配。
结合第二方面,在一种可能的实现方式中,上述判断第二语音与预定的第二规则匹配的过程,可以为:识别该第二语音的内容,并判断此内容是否与预定的内容匹配。
可选地,所述计算机设备还可以解除所述第一声纹与所述第一输出位置的对应关系。
可选地,所述计算机设备还可以建立第二声纹与所述第一输出位置的对应关系。并且,在接收到第三语音,并且第三语音的声纹与所述第二声纹匹配时,可以将该第三语音的内容输出到所述第一输出位置的空白处,或者覆盖上述第一语音的内容。
可选地,当上述第二语音是由阵列麦克风采集时,还可以计算该第二语音的声源的位置,并在上述建立所述第一声纹与所述第一输出位置的对应关系之前,判断所述第二语音的声源位置与所述触点位置是否满足预设的条件。
可选地,所述计算机设备还可以接收图像采集器采集的触摸屏前方的图像,并在上述建立所述第一声纹与所述第一输出位置的对应关系之前,实时分析和跟踪该上述图像的内容,并根据图像分析和跟踪的结果判断执行所述触摸操作的用户与发出所述第二语音的用户为同一个用户。或者,通过联合声源定位和图像跟踪来判断执行触摸操作的用户与发出所述第二语音的用户为同一个用户。
第二方面及其任一种可能的设计提供的技术方案的相关内容的解释和有益效果的描述均可以参考上述第一方面或其相应的可能的设计提供的技术方案,此处不再赘述。
第三方面,本申请提供一种人机交互系统,该人机交互系统可以用于执行上述第一方面或第二方面提供的任一种方法。该人机交互系统可以包括触摸屏和处理器。
触摸屏,用于接收触摸操作,并将所述触摸操作的位置发送给所述处理器。
处理器,执行上述第二方面提供的任一种人机交互方法。所述处理器的任一种可能实现的技术方案的相关内容的解释和有益效果的描述均可以参考上述第二方面或其相应的可能的设计提供的技术方案,此处不再赘述。
可选的,上述人机交互系统还可以包括语音采集器,用于采集第一语音,第二语音以及第三语音,并将第一语音、第二语音以及第三语音发送给所述处理器。
可选地,上述人机交互系统还可以包括图像采集器,用于采集触摸屏前方附近的图像,并将其发送给所述处理器。第三方面中,处理器和触摸屏执行的可能的技术方案和有益效果的描述均可以参考上述第一方面或第二方面或其相应的可能的设计提供的技术方案,此处不再赘述。
第四方面,本申请提供了一种计算机设备,该计算机设备可以用于执行上述第二方面提供的任一种方法,该情况下,该计算机设备具体可以是处理器或包含处理器的设备。
在一种可能的设计中,可以根据上述第二方面提供的任一种方法,对该装置进行功能模块的划分。在这种实现方式下,该计算机设备包括语音处理单元以及综合处理单元。
语音处理单元用于接收第一语音,并在判断第一语音的声纹与第一声纹匹配时,识别该语音的内容。
综合处理单元用于建立所述第一声纹与第一输出位置的对应关系;还用于将上述第一语音的内容输出到所述第一输出位置上。
所述计算机设备还包括触点位置处理单元,用于接收触点位置,该触点位置由触摸操作产生;还用于在判断触点位置与预定的第一规则匹配时,根据所述触点位置确定第一输出位置。
所述语音处理单元还用于接收第二语音,并在判断该语音与预定的第二规则匹配时,从该语音中提取第一声纹。
在上述过程中,触点位置处理单元在根据触点位置确定第一输出位置时,具体用于:根据所述触点位置中的触点位置集合确定第一输出位置的起始位置和范围。可选地,综合处理单元还用于输出第一输出位置。
可选地,综合处理单元还用于解除上述第一声纹与第一输出位置的对应关系。
可选地,语音处理单元还用于接收第三语音,并在判断第三语音的声纹与第二声纹匹配时,识别第三语音的内容。综合处理单元还用于建立所述第二声纹与所述第一输出位置的对应关系;并用于将所述第三语音的内容输出到所述第一输出位置的空白处,或覆盖上述第一语音的内容。此时,因此,综合处理单元在输出第三语音时,还用于选择不同于输出上述第一语音的内容时的输出格式。
可选地,当语音处理单元接收的所述第二语音是由阵列麦克风采集的时,所述语音处理单元还用于计算第二语音的声源的位置。此时,综合处理单元还用于在上述建立第一声纹与第一输出位置的对应关系之前,判断第二语音的声源的位置与所述触点位置是否满足预设的条件。
可选地,上述计算机设备还可以包括图像处理单元,用于接收图像。图像处理单元还用于在综合处理单元建立上述第一声纹与第一输出位置的对应关系之前,实时分析和跟踪该图像,然后根据分析和跟踪的结果判断执行所述触摸操作的用户与发出所述第一语音的用户是否为同一个用户。
在另一种可能的设计中,该计算机设备包括:存储器和一个或多个处理器;存储器和处理器耦合。上述存储器用于存储计算机程序代码,该计算机程序代码包括计算机指令,当该计算机指令被计算机设备执行时,使得计算机设备执行如第二方面及其任一种可能的设计方式所述的人机交互方法。
第五方面,本申请提供一种计算机可读存储介质,该计算机可读存储介质包括计算机指令,当该计算机指令在人机交互系统上运行时,使得人机交互系统实现如第一方面或第二方面提供的任一种可能的设计方式所述的人机交互方法。
第六方面,本申请提供一种计算机程序产品,当该计算机程序产品在人机交互系统上运行时,使得人机交互系统实现如第一方面或第二方面提供的任一种可能的设计方式所述的人机交互方法。
本申请中第二方面到第六方面及其各种实现方式的具体描述,可以参考第一方面及其各 种实现方式中的详细描述;并且,第二方面到第六方面及其各种实现方式的有益效果,可以参考第一方面及其各种实现方式中的有益效果分析,此处不再赘述。
在本申请中,上述人机交互系统的名字对设备或功能模块本身不构成限定,在实际实现中,这些设备或功能模块可以以其他名称出现。只要各个设备或功能模块的功能和本申请类似,属于本申请权利要求及其等同技术的范围之内。
本申请的这些方面或其他方面在以下的描述中会更加简明易懂。
附图说明
图1为本申请实施例提供的人机交互系统的一种硬件结构图。
图2为本申请实施例提供的人机交互系统的一种结构示意图一。
图3为本申请实施例提供的人机交互系统的一种结构示意图二。
图4为本申请实施例提供的人机交互方法的流程示意图。
图5A-图5C为本申请实施例提供的根据触摸操作确定输出位置的示意图。
图6A和图6B为本申请实施例提供的计算发出第一语音的用户的位置的方法示意图。
图6C为本申请实施例提供的判断语音和人脸特征的对应关系的方法示意图。
图7为本申请实施例提供的计算机设备的结构示意图。
图8为本申请实施例提供的计算机程序产品的结构示意图。
具体实施方式
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
在本申请的实施例中,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,除非另有说明,“多个”的含义是两个或两个以上。
本申请实施例提供一种人机交互方法、装置和系统,人机交互系统通过获取触摸操作和第一语音,当判断所述触摸操作符合预定的第一规则、以及所述第一语音符合预定的第二规则时,建立第一语音的声纹与输出位置的对应关系;其中,上述输出位置是指根据所述触摸操作确定的触摸屏上的输出位置;然后获取第二语音,并判断该第二语音的声纹与第一语音的声纹是否匹配,如果匹配,则将第二语音对应的文本内容输出并显示到触摸屏上的所述输出位置。
可选地,在上述过程中,在建立上述第一语音的声纹与输出位置的对应关系之前,还可以进一步判断第一语音的声源的位置与触摸操作的位置是否满足预设的条件,如果判断为是,才建立上述对应关系。上述的判断是为了进一步确认上述执行触摸操作的用户和发出第一语音的用户是否为同一个用户,因而可以提升建立所述对应关系的准确性,最终可以提升系统的鲁棒性。
同样可选地,在建立上述第一语音的声纹与输出位置的对应关系之前,还可以利用摄像头采集触摸屏前方附近的图像,并根据对采集的图像实时分析和跟踪来判断执行触摸操作的用户和发出第一语音的用户是否为同一个用户,该过程同样可以提升建立所述对应关系的准确性, 进而提升系统的鲁棒性。
通过本申请实施例,可以让一个或多个用户以语音的方式实现触摸屏的输入,提升了输入的效率;并且能够让各个用户输入的语音内容输出并显示到触摸屏上该用户指示的输出位置处,避免将不同用户的语音内容混淆显示在触摸屏上的问题,因此也可以提升输入的体验。
上述人机交互方法可以通过安装在设备上的应用程序实现,例如人机交互应用程序。
上述应用程序可以是安装在设备中的嵌入式应用程序(即设备的系统应用),也可以是可下载应用程序。其中,嵌入式应用程序是作为设备(如手机)实现的一部分提供的应用程序。可下载应用程序是一个可以提供自己的因特网协议多媒体子系统(internet protocol multimedia subsystem,IMS)连接的应用程序,该可下载应用程序是可以预先安装在设备中的应用或可以由用户下载并安装在设备中的第三方应用。
图1,为本申请实施例提供的人机交互系统的硬件结构。如图1所示,人机交互系统10包括处理器11、存储器12、触摸屏13以及语音采集器14。可选地,人机交互系统10还可以包括图像采集器15。
处理器11是人机交互系统10的控制中心,可以是一个通用中央处理单元(central processing unit,CPU),也可以是其他通用处理器等,例如图形处理器(Graphics processing unit,GPU)。其中,通用处理器可以是微处理器或者是任何常规的处理器等。作为一个示例,处理器11可以包括一个或多个CPU,例如图1中所示的CPU 0和CPU 1。可选地,处理器11还可以包括一个或多个GPU,例如图1所示的GPU0和GPU1。
存储器12,可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器12可以独立于处理器11,也可以通过总线与处理器11相连接,还可以和处理器11集成在一起。存储器12用于存储数据、指令或者程序代码。处理器11调用并执行存储器12中存储的指令或程序代码时,能够实现本申请实施例提供的人机交互方法。
触摸屏13,具体可以包括触控板131和显示屏132。
其中,触控板131可以采用电阻式、电容式、红外线以及表面声波等多种类型来实现触控板。触控板131用于采集用户在其上或附近的触摸事件(比如用户使用手指、触控笔等任何适合的物体在触控板上或在触控板附近的操作),并将采集到的触摸信息发送给其他器件(例如处理器11)。其中,用户在触控板附近的触摸事件可以称之为悬浮触控;悬浮触控可以是指,用户无需为了选择、移动或拖动目标(例如图标等)而直接接触触控板,而只需用户位于设备附近以便执行所想要的功能。
显示屏132可以采用液晶显示屏、有机发光二极管等形式来配置显示屏132。触控板131可以覆盖在显示屏132之上,当触控板131检测到在其上或附近的触摸事件后,传送给处理器11以确定触摸事件的类型,处理器11可以根据触摸事件的类型在显示屏132上提供相应的视觉输出。显示屏132用于显示由用户输入的信息或提供给用户的信息。
语音采集器14也称“话筒”,或“传声器”等,可以是单麦克风;或者可选地,也可以是麦克风阵列。语音采集器14用于接收语音信号,并将语音信号转换为电信号后发送给其他 器件(例如处理器11)处理。当语音采集器14为麦克风阵列时,还用于定位声源的位置。
图像采集器15可以是CCD、CMOS等成像设备,也称“摄像头”。图像采集器15用于采集图像,并将采集的图像数据发送给其他器件(例如处理器11)处理。
上述处理器11、存储器12、触摸屏13、语音采集器14以及图像采集器15可以集成在一个设备上,在这种实现下,人机交互系统10可以是电子白板、智能手机、带触摸屏的笔记本电脑、带触摸屏的计算机、平板、上网本、车载等终端设备。示例性的,如果是电子白板,参考图2所示,上述人机交互应用程序可以在电子白板20内运行。可选的,该人机交互系统10还可以包括触摸笔21,触摸笔21用于在电子白板20的触摸屏13上输入触摸操作。
此外,上述处理器11、存储器12、触摸屏13、语音采集器14以及图像采集器15也可以分别集成在不同的设备上,在这种实现下,上述人机交互系统10可以包括多个设备,以执行本申请实施例提供的人机交互方法。示例性的,如图3所示的人机交互系统10可以包括:电子白板20、计算机32和投影机33。可选的,人机交互系统10还可以包括触摸笔21,触摸笔21用于在电子白板20的触摸屏13上输入触摸操作。其中,处理器11可以是计算机32的处理器。存储器12可以是计算机32的存储器。这时,上述人机交互应用程序可以在计算机32内运行。另外,触摸屏13可以是电子白板20的触摸屏。语音采集器14可以集成在电子白板20中。或者,语音采集器14也可以集成在计算机32、投影机33或者触摸笔21中,本申请实施对此不作限定。图像采集器15可以集成在电子白板20中。
需要说明的是,如果上述语音采集器14是麦克风阵列时,其集成位置与触摸屏需要满足一定关系;具体地,其集成的位置与可以在触摸屏的上方或者下方、与触摸屏的水平方向平行,且其中点与触摸屏的水平方向的中点重合,例如可以将其集成在触摸屏的上边沿或下边沿上(如图2或图3所示的部署方式)。此外,则其部署的位置也可以在触摸屏的左边或者右边、与触摸屏的垂直方向平行,且其中点与触摸屏的垂直方向的中点重合。
图1中示出的结构并不构成对该人机交互系统10的限定,除图1所示部件之外,该人机交互系统10可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置;上述对人机交互系统10的描述也仅为示例性说明,并不构成对本实施例的限定。
下面结合附图对本申请实施例提供的人机交互方法进行描述。
请参考图4,图4示出了本申请实施例提供的人机交互方法的流程示意图,本实施例中以在会场中(该会议室里部署有电子白板),多个用户讨论时,有的用户想要以语音方式在电子白板上记录一些内容的场景为例,该人机交互方法包括以下但不限于以下步骤:
S402、触摸屏接收触摸操作,确定触摸操作的位置信息。
用户A通过手指或触摸笔在触摸屏上进行了触摸操作,则触摸屏会感应到该操作,并得到该触摸操作的位置信息。触摸操作的位置信息可以是用户A在执行触摸操作时手指或触摸笔在触摸屏上的触点的位置,例如,可以是用户A的触摸操作产生的触点在触摸屏中的坐标。
触摸操作的位置信息可以是单个触点在触摸屏的位置,例如,用户A用单个手指触摸屏幕所产生的触点位置;触摸操作的位置信息还可以是多个触点在触摸屏的位置,例如,用户A同时用左手和右手的食指触摸屏幕所产生的两个触点位置。
再例如,用户A用手指或触摸笔在触摸屏上画一条线,在该触摸操作过程中,触摸屏连续多次得到该触摸操作的位置信息。其中,每次得到的位置信息中包含单个触点位置,代表用户A在画一条线的滑动过程中,不同时刻所产生的触点位置。
S404、触摸屏向处理器发送触摸操作的位置信息。
具体的,触摸屏可以周期性或触发性或实时性地向处理器发送检测到的触摸操作所产生的触点位置。例如,触摸屏可以在每个周期向处理器发送一帧数据。该帧数据包括该周期内检测到的触点位置;可以理解的是,当在该周期内没有检测到触点时,发送的该帧数据中不会包含触点的位置。
由于触摸屏可能会按一定的时间周期向处理器发送检测周期内单个或者多个触点的位置;为了保持对触摸操作的响应灵敏度和实时性,上述时间周期通常会很短(例如ms级),因而,在用户A完成一个完整触摸操作的持续过程中,触摸屏通常会多次向处理器发送触摸操作的位置信息。
可选地,在有些实现下,触摸屏在向处理器发送触摸操作的位置信息时,还会携带所述位置信息中每个触点位置对应触摸状态信息,例如是“按下”、“抬起”或者“移动”等状态。
S406、处理器接收触摸操作的位置信息,并判断触摸操作与预定的第一规则是否匹配。
处理器实时接收触摸操作的位置信息。
处理器通常会在用户A完成一个完整触摸操作的过程中,多次收到触摸操作的位置信息。因此,处理器在接收到触摸操作的位置信息时,会先判断某个用户(此处以用户A为例)本次触摸操作是否完成。具体地,处理器可以通过踪触摸操作来判断触摸操作是否完成。例如,处理器在每次接收到触摸操作的位置信息时开始计时,如果在预定的时间间隔内,没有再次接收到触摸操作的位置信息,则判断当前的触摸操作操作完成;并从判断当前触摸操作结束的时刻起,将后续收到的触摸操作的位置信息作为下一个的触摸操作的位置信息。如果在预定的时间间隔内,再次接收到了触摸操作的位置信息,即认为当前触摸操作还未完成,并将该再次接收到的位置信息作为当前触摸操作的位置信息。然后从所述再次接收到了触摸操作的位置信息的时刻开始计时,按照上述的过程继续判断;直到在接收到某次触摸操作的位置信息后,在预定的时间间隔内没有再次接收到触摸操作的位置信息,则认为触摸操作结束。其中,预定时间间隔根据经验值或者实际情况设置,本申请中不限定。处理器还可以根据上述触摸操作的位置信息中携带的触摸状态信息判断触摸操作是否结束。例如当触摸状态为“按下”或者“移动”时,则判断触摸操作未完成;当触摸状态为“抬起”时,则判断触摸操作已完成。
上述处理器判断触摸操作是否完成的方法只是示例性的方法。根据不同触摸屏厂家的驱动实现方式的不同,处理器判断触摸操作是否完成的方法也可能不同,本申请实施例对此不作限定。
当处理器判断本次触摸操作完成时,然后判断该触摸操作与预定的第一规则是否匹配。其中,第一规则是预先定义或者配置的关于触摸操作的任意规则,该规则用以判断用户A是否想启动语音输入。示例性地,预定的第一规则可以为“触摸屏上的任意一个点”;还可以为“触摸屏上的两个点,且两个点的连线与触摸屏的水平方向平行、两个点距离不低于预设的值M”;也可以为“一条触摸轨迹,且该触摸轨迹与触摸屏水平方向平行,且长度不低于预设的值”。处理器可以判断接收到的本次触摸操作所产生的一个或多个触点位置集合是否满足上述第一规则定义的条件,如果满足,则可认为该触摸操作与预定的第一规则匹配。
此外,预定的第一规则还可以是包含某种形状的图像,在这种实现方式下,处理器可以利用图像匹配的技术,将本次触摸操作过程中接收到的触摸操作的位置信息所构成的形状与预定的形状进行比对,如果比对的结果一致,则可以认为触摸操作与预定的第一规则匹配。
预定的第一规则的内容、以及判断触摸操作与预定的第一规则是否匹配的方式可以是除 上述例子外的其他内容和方式,本申请不做限定。
在本步骤中,如果处理器判断触摸操作与预定的第一规则匹配,则初步判断用户A想要启动语音输入,继续执行本方法中的后续步骤;反之,如果处理器判断用户A的触摸操作与预定的第一规则不匹配,则认为用户A是其他的触摸操作,流程结束。
S408、处理器根据触摸操作的位置信息确定触摸屏上的输出位置。
如S406中描述,处理器在判断触摸操作符合预定的第一规则时,可以判断用户A想要启动语音输入,并且还可以从该触摸操作的位置信息中确定用户A指示的触摸屏上输出位置。其中,该输出位置用于指示语音内容输出在触摸屏上的起始位置和范围,即显示语音内容的输出框的位置和大小。下面以输出框的形状为矩形的情况为例,说明所述输出位置包含的内容和确定的方式。
在第一种实现方式下,所述输出位置包含输出框的起始位置、以及输出框的宽度和高度。如果以从左到右、从上到下的输出顺序为例,则上述输出框的起始位置可以是输出框左上角的坐标。上述输出位置的确定方式可以为:以触摸屏的左上角的点为原点,以原点为端点的一边为横轴,以原点为端点的另一边为纵轴。从接收到的触摸操作的位置信息所包含的触点集合中,选出最小水平坐标对应的触点或者最小垂直坐标对应的触点为第一触点,以及选出最大水平坐标对应的触点或者最大垂直坐标对应的触点为第二触点。例如图5A-图5B所示的触点集合中,处理器可以将触点A或B作为第一触点,将触点C或者D作为第二触点。然后将第一触点作为输出框的起始位置,并将第二触点与第一触点的水平坐标差作为输出框的宽度、将第二触点与第一触点的垂直坐标差作为输出框的高度。当以触点A为第一触点、且触点C为第二触点时,输出框的起始位置、以及其宽度和高度如图5A所示。当以触点A为第一触点、且触点D为第二触点时,输出框的起始位置、以及其宽度和高度如图5B所示。当以触点B为第一触点,以触点C或D作为第二触点的情况同理可得。
上述输出位置的确定方式也可以为:以触摸屏的左上角的点为原点,从接收到的触摸操作的位置信息所包含的触点集合中,将最小的水平坐标和最小的垂直坐标构成的点(该点可能不是实际的触点)作为第一触点,将最大的水平坐标和最大的垂直坐标构成的点作为第二触点(该点也可能不是实际的触点)。然后用上述类似的方式,可确定输出框的起始位置、及其宽度和高度。例如在图5A-图5C所示的触点集合中,取触点A的水平坐标xa和触点B的垂直坐标yb构成的点(xa,yb)作为第一触点。取触点D的水平坐标xd和触点C的垂直坐标yc构成的点(xd,yc)作为第二触点。此时,输出框的起始位置、以及其宽度和高度如图5C所示。
在第二种实现方式下,所述输出位置包含输出框的起始位置,及结束位置。如果仍以从左到右、从上到下的输出顺序为例,则输出框的起始和结束位置分别为输出框左上角和右下角的坐标。类似上述第一种实现方式中的介绍,处理器可以从接收到的触摸操作的位置信息所包含的触点集合,确定输出框的左上角和右下角的坐标。例如,可以分别将第一种实现方式中确定的所述第一触点和第二触点的坐标分别作为输出框的左上角和右下角的坐标,即可确定该输出框的起始位置和结束位置。具体过程不再赘述。
可选地,系统可以定义输出框的高度为预设的值,在这种实现方式下,本步骤只需要确定输出框的起始位置,以及宽度。可替代地,系统也可以定义输出框的高度和宽度都为预设的值,在这种实现方式下,本步骤只需要确定输出框的起始位置。
上述输出位置包含的内容和确定方式仅为示例性的描述;在有些实现方式下,系统也可以定义上述显示语音内容的输出框的形状为除矩形之外的其他形状,如菱形、圆形等。相应地,输出位置中包含的内容以及确定方式可能是上述第一种和第二种实现方式之外的其他实现方式,本申请实施例对此不做限定。
可选的,处理器还可以根据确定出的输出位置(即输出框的位置和大小),调用渲染指令生成该输出位置,以便于在显示屏上可以显示该输出位置。例如,可以以边框的形式生成并输出上述输出位置,输出框的边框可以是如图5A-图5C所示的黑色虚线框,也可以是其他颜色的虚线框,还可以是黑色实线框或其他颜色的实线框。处理器可以通过改变输出位置背景颜色的方式、或者通过任意能够将输出位置与显示屏上当前显示的背景区分开的方式显示输出位置。本申请实施例对此不作限定。
S410-S412、语音采集器采集、并向处理器发送第一语音。
具体地,语音采集器实时采集第一语音信号,并将采集到的第一语音发送给处理器。
可以理解的是,由于语音采集器所采集的语音信号来自于会场中,因而,上述第一语音可以是由会场中的用户A发出的,也可以由会场中的其他用户发出的。
S414、处理器接收第一语音,并判断第一语音与预定的第二规则是否匹配。
在本实施例中,所述第二规则是用以判断发出第一语音的用户,例如用户A,是否想启动语音输入。
预定的第二规则,可以是特定内容的文本,如“启动语音输入”。
判断第一语音与预定的第二规则是否匹配的方法,可以是首先利用语音识别的技术将接收到的第一语音信号转换为文本,然后采用文本匹配的方法判断接收到的第一语音的内容是否与上述特定内容一致。如果一致,则判断为第一语音与预定的第二规则匹配;反之,则可以判断为不匹配。需要理解的是,处理器所采用文本匹配方式,可以不仅是字面上匹配,还可以是语义层面的匹配。例如,如果处理器接收的第一语音的内容为“开始语音输入”,处理器确定该语音的内容与上述特定内容“启动语音输入”在语义层面是匹配的,因而判断该第一语音与预定的第二规则匹配。
上述预定的第二规则的内容、以及判断第一语音与预定的第二规则是否匹配的方式可以是除上述例子外的其他内容和方式,本申请实施例不作限定。
在本步骤中,如果处理器判断第一语音与预定的第二规则匹配,并且结合在上述步骤S406中,处理器已经接收到符合第一规则的触摸操作,则认为该第一语音是由用户A发出的,即认为步骤S406中接收到的触摸操作和本步骤中接收到到的第一语音分别是由同一个用户执行和发出的,并确认用户A要启动语音输入,则继续执行本方法后续的步骤。如果处理器判断第一语音与预定的第二规则不匹配,则认为该第一语音不是由用户A发出的,即认为步骤S406中接收到的触摸操作和本步骤中接收到的第一语音并不是由同一个用户执行和发出的,例如可能是由会场中的其他用户随意发出的语音。此时处理器确认不是用户A想要启动语音输入,则不会执行后续的步骤。
S416、处理器建立第一语音的声纹与所述输出位置的对应关系。
具体地,处理器对第一语音信号进行去噪等处理,从处理后的信号中提取出第一语音的声纹特征,例如如声学或语言特征,并建立第一语音的声纹,即用户A的声纹特征,与S408 中确定的所述输出位置,即用户A的触摸操作所指示的输出位置之间的对应关系。建立对应关系的具体过程可以包括:将用户A的声纹特征数据作为key、以S405中确定的触摸屏上的输出位置(假设该输出位置为(x1,y1,w1,h1))作为value,按照字典或哈希表的方式保存如表1所示的声纹库,以便于在后续步骤中可以通过用户A的声纹确定其对应的触摸屏的输出位置。
表1
用户 声纹(key) 输出位置(value)
A 声纹特征A (x1,y1,w1,h1)
S418-S420、语音采集器采集第二语音、并向处理器发送第二语音。
在会议进行的过程中,语音采集设备会采集到不同人的语音,因此,第二语音可能是由用户A发出的,也可能是由其他用户发出的。
S422、处理器接收第二语音,并判断第二语音的声纹与第一语音的声纹是否匹配。
如上所述,由于第二语音可能是由任意用户发出的,在某些场景下,对于用户A在输出语音到输出位置的过程中,并不希望被别的用户的语音所干扰。因此,处理器在接收到第二语音时,会判断第二语音是否是由用户A发出。
具体地,处理器可以对第二语音采取类似S414中对第一语音的处理方式,对第二语音进行去噪等处理后、再提取其声纹特征;然后采用声纹识别的技术判断第二语音的声纹与表1所示的声纹库中用户A的声纹是否匹配,从而可以判断第二语音是否是由用户A发出的。
如果本步骤的判断结果为是,说明是用户A的语音,则执行步骤S424。如果判断的结果为否,说明不是用户A的语音,则可以丢弃该语音信号,并且回到步骤S418,重新执行语音信号的采集和判断的过程。
S424-S426、处理器识别第二语音的内容,并将该内容输出并显示到触摸屏上所述输出位置上。
处理器可以采用语音识别的技术将第二语音信号转换为文本内容,然后将该文本内容输出、显示到触摸屏上所述输出位置上;其中,所述输出位置是指S416中建立的与第一语音的声纹相对应的输出位置。
处理器输出第二语音内容的过程可以为:根据输出位置信息和文本内容,调用渲染指令生成包含所述第二语音内容的文本窗口,再根据所述输出位置信息将生成的所述文本窗口与其它的窗口(例如系统背景窗)融合生成像素阵列,然后将该像素阵列输出到预定的存储区(例如系统的帧缓冲区)。这样,在系统发出显示信号(例如垂直同步信号)时,所述预定存储区的内容就可以显示在触摸屏上了。上述处理器输出第二语音内容的方式仅为示例性描述。根据处理器所运行的操作系统的实际显示机制的不同,处理器输出所述第二语音内容的方式也可能不同,本申请实施例对此不作限定。
在执行完本步骤之后,本申请实施例还可以继续采集第三语音,并且对第三语音采取与第二语音一致的处理和判断过程;这样,在用户A成功启动了语音输入功能后,可以持续地通过语音输入向用户A指示的输出框(通过上述触摸操作所指示的输出位置信息)中输出内容。
可以理解的是,如果用户A通过实际输入的语音内容超过了输出位置信息所指示的输出 框的范围,则可以根据情况生成输出框的水平或垂直方向的滚动条,以保证用户A输入的语音内容都被记录下来,并且可以通过滚动条展示记录的所有内容。在这种情况下,处理器可以在所述输出位置默认输出最近输入的内容。当检测到用户在操作所述滚动条时,根据操作的滚动条的位置,输出相应部分的语音文本内容,以便于用户可以翻看自己想看的内容。
此外,实际应用中,语音采集器可以采用阵列麦克风。由于阵列麦克风可以定位声源位置,在另外一种实现中,在上述的步骤S414和S416之间,还可以包括下述步骤415。
S415、处理器判断第一语音的声源的位置与触摸操作的位置信息是否满足预设的条件。
执行触摸操作的用户的位置通常在触摸操作的位置附近,因而处理器可以通过判断第一语音所属的声源的位置与触摸操作的位置信息是否满足预设的条件,来验证发出第一语音的是否为用户A(即进一步验证步骤S406中接收的执行触摸操作的用户和步骤S414中接收的发出第一语音的用户是否为同一个用户)。
具体地,处理器接收阵列麦克风采集的第一语音,根据第一语音计算第一语音所属的声源位置,得到声源相对于阵列麦克风的位置。如上述在人机交互系统(参考图2或图3)的介绍中所描述,通常阵列麦克风集成在触摸屏的上边沿或下边沿上(即阵列麦克风的集成位置满足与触摸屏的水平方向平行,且其中点与触摸屏的水平方向的中点重合),因此,也可以将所述声源相对于阵列麦克风的位置作为所述声源相对于触摸屏的位置;该位置包括声源相对于触摸屏的垂直距离,以及相对于触摸屏水平方向的中点的水平距离。
假设采集第一语音的阵列麦克风为包含n个麦克风(MC1~MCn)的阵列麦克风,利用该阵列麦克风计算声源的位置的过程如下:
参考图6A,将MC1和MC2作为麦克风阵列的左端的子麦克阵列,C1是该左端的子麦克阵列的中点;同理,将MC3和MC4作为右端的子麦克阵列,C2是该右端的子麦克阵列的中点。
首先利用声源到达MC1和MC2的时间差(即MC1和MC2采集语音信号的时间差)及MC1与MC2之间的距离,计算出声源和C1的连线与阵列麦克风的夹角为α1;同理可以计算出声源和C2的连线与阵列麦克风的夹角为α2。
进一步地,参考图6B,由于C1和C2之间的距离是已知的,因而可以根据三角函数关系计算出声源相对于阵列麦克风的垂直距离(即相对于触摸屏的垂直距离)H,以及相对于阵列麦克风的中点的水平距离(即相对于触摸屏的水平方向的中点的水平距离)W。
可以理解的是,触摸操作的位置相对于触摸屏的垂直距离为0,相对于触摸屏水平方向的中点的水平距离可以根据触摸操作的位置信息中的水平坐标获得;因此,可以以触摸屏作为参考,得到声源与触摸操作的位置的垂直距离(即声源相对于触摸屏的垂直距离),以及声源与触摸操作的位置的水平距离。进而判断如果声源与触摸操作的位置的垂直距离不超过预定的范围(例如,基于经验值设定为0.5m),且声源与触摸操作的位置的水平距离也不超过预定的范围(例如,基于经验值设定为0.5m),则认为发出第一语音的是用户A(即本步骤判断的结果为是时,认为执行触摸操作的用户和发出第一语音的用户为同一个用户);否则,则认为发出第一语音的不是用户A(即本步骤判断的结果为否时,认为执行触摸操作的用户和发出第一语音的用户不是同一个用户)。
可选地,上述判断执行触摸操作的用户和发出第一语音的用户不是同一个用户的方式还可以通过对触摸屏前的图像的采集和跟踪来实现。假设在本申请实施例中,摄像头部署在触摸屏上,例如位于触摸屏上边沿的中心(例如图2或图3所示的位置),可以采集到触摸屏前 方附近的图像并实时发送给处理器;处理器对摄像头采集发送的图像实时分析和跟踪,例如对图像中的人的肢体动作和唇动信息进行分析和跟踪,从而判断执行触摸操作的用户与发出第一语音的用户是否为同一个用户。
此外,还可以通过联合第一语音的声源的位置计算和图像跟踪,来判断执行触摸操作的用户与发出第一语音的用户是否为同一个用户。在这种情况下,假设摄像头部署在阵列麦克风的中点上,如图6C所示。此时,处理器可以通过机器学习算法识别执行触摸操作的用户的人脸特征,例如用户A的人脸特征Face-A,并将所述Face-A与用户A的触摸操作指示的位置,例如(x1,y1,w1,h1)绑定。然后计算用户A的人脸特征即Face-A相对于摄像头的夹角a1-A。通过上述声源位置计算可以得到第一语音的声源相对于阵列麦克风中点(即摄像头)的夹角a2-A。然后,通过比较上述a1-A和a2-A,如果二者相差在预定的范围内,则认为第一语音与Face-A是属于同一个用户的特征,因此可以判断执行上述触摸操作的用户与发出第一语音的用户是为同一个用户,因而可以将第一语音的声纹与(x1,y1,w1,h1)对应起来。
当会场上有另一个用户与用户A同时启动语音输入时,通过上述方法,可以避免将用户A的声纹与所述另一个用户所指示的输出位置建立对应关系,因而可以提升系统的鲁棒性。
还需要说明的是,上述步骤S402-S406中对用户A触摸操作的检测和判断,以及步骤S410-S414中对用户A的语音信号的采集和判断,是判断用户A是否要启动语音输入的两个条件,两个条件的判断没有先后顺序,即步骤S402-S406与步骤S410-S414的执行没有现有顺序,可以先执行触摸操作的检测和判断,也可以先执行语音信号的采集和判断。
上述描述可知,本申请实施例中,终端设备根据用户A的触摸操作和语音输入判断用户A想要启动语音输入时,并将用户A的声纹与用户A指示的文本输出位置对应起来。这样,当会场中有多个人发言时(包括用户A),可以通过声纹匹配,实现只将用户A的语音内容、而不是会场中其他用户的语音内容显示到上述用户A指示的文本输出位置上(而不是触摸屏上的其他位置上)。
实际应用中,可能存在如果用户A在指定的输出位置上启动了语音输入后,想要更换语音输入的输出位置的场景,例如如果用户A想要将输出位置从(x1,y1,w1,h1)更换到(x3,y3,w3,h3)时,本申请可通过类似的过程完成用户A的声纹与(x3,y3,w3,h3)的绑定,从而实现将用户A的语音内容输出到(x3,y3,w3,h3)指示的输出框内。在这种场景下,系统保存的声纹库会从上述表1示意的内容更新到如下所示的表2所示意的内容。具体地,系统可通过如下两种可能的方式来实现该场景:
第一种可能的实现方式可以为:用户A可以通过发出符合第三规则的语音(例如该语音内容与“关闭语音输入”匹配)来指示系统解除用户A的声纹与输出位置(x1,y1,w1,h1)的对应关系。具体地,当处理器检测符合第三规则的语音时,提取该语音的声纹,并将该声纹与系统保存的声纹库中声纹逐一比对,然后将能与该声纹匹配上的声纹和其对应的输出位置从所述声纹库中删除。然后,用户A再通过上述类似的过程(执行符合第一规则的触摸操作和发出符合第二规则的语音)来指示系统再次启动语音输入,建立其声纹新指定的输出位置(x3,y3,w3,h3)的对应关系。
第二种可能的实现方式可以为:用户A直接通过上述类似的过程来指示系统重新启动语音输入,建立其声纹与新指定的输出位置(x3,y3,w3,h3)的对应关系。在这种实现方式下,当处理器接收到符合第二规则的语音时,判断该语音的声纹已经在系统保存的声纹表中 存在,则直接更新该声纹对应的输出位置,从而将用户A的声纹对应的输出位置更新为(x3,y3,w3,h3)。
表2
用户 声纹(key) 输出位置(value)
A 声纹特征A (x3,y3,w3,h3)
可以理解的是,如上所述,会场上通常会有多人讨论,因此,不仅用户A在触摸屏上操作,在用户A启动和执行语音输入的过程中,其他用户,比如用户B也可以在同一触摸屏上操作。基于同样的实现,本申请也可以实现同时将用户B的语音输出显示到触摸屏上用户B所指示的输出位置上。具体地,处理器在会在上述表2的基础上,增加保存用户B的声纹与用户B指示的输出位置的对应关系。此时,系统保存的声纹库为如表3所示的内容。
表3
用户 声纹(key) 输出位置(value)
A 声纹特征A (x3,y3,w3,h3)
B 声纹特征B (x2,y2,w2,h2)
需要说明的是,当用户A和用户B同时启动语音输入时,需要采用上述步骤S415来进一步验证执行触摸操作的用户与发出语音的用户是否为同一个用户,从而可以避免将用户A的声纹与用户B所指示的输出位置建立对应关系,因而可以提升系统的鲁棒性。在这种情况下,则在S422中,处理器可以将第二语音的声纹与表3所示意的声纹库中的声纹逐一比对。如果能与其中一个声纹匹配上,则认为S422的判断结果为是。反之,如果第二语音不能与系统保存的声纹库中的声纹中的任何一个匹配上,则认为S422的判断结果为否,则不输出第二语音。
因此,本申请实施例通过声纹匹配识别第二语音对应的用户身份,可以避免在多人同时通过语音输入时,因无法区分发出语音的用户的身份而将不同用户的语音内容混淆显示在触摸屏上的问题。
此外,实际应用中,用户A已经解除了跟特定输出位置的绑定关系,而用户C可能想要对用户A历史指定的输出位置上的内容,例如输出位置(x1,y1,w1,h1)中的内容,进行修改或补充。此时,系统基于同样的实现建立用户C的声纹与输出位置(x1,y1,w1,h1)的对应关系,并将用户C的语音内容追加输出在输出框(x1,y1,w1,h1)已有的内容之后。或者系统基于同样的实现建立用户C的声纹与输出位置(x1’,y1’,w1’,h1’)的对应关系,并将用户C的语音内容输出在输出框(x1’,y1’,w1’,h1’)内;其中,(x1’,y1’,w1’,h1’)是叠加在(x1,y1,w1,h1)之上的输出框。因此,在第一种可能的场景下,系统保存的声纹库的内容可以如表4A-1或表4A-2所示。
表4A-1
用户 声纹(key) 输出位置(value)
A 声纹特征A (x3,y3,w3,h3)
B 声纹特征B (x2,y2,w2,h2)
C 声纹特征C (x1,y1,w1,h1)
表4A-2
用户 声纹(key) 输出位置(value)
A 声纹特征A (x3,y3,w3,h3)
B 声纹特征B (x2,y2,w2,h2)
C 声纹特征C (x1’,y1’,w1’,h1’)
此外,可能还存在用户C对用户A当前输出位置上,例如对(x3,y3,w3,h3)的内容进行修改或补充的场景。同理,系统可以建立用户C与输出位置(x3,y3,w3,h3)的对应关系,由于用户A和用户C的声纹与同一个输出位置对应,因此用户A和用户C可以同时在(x3,y3,w3,h3)中输入语音内容,此时,系统可以按照用户A和用户C的语音的采集时间的先后关系,依次输出到(x3,y3,w3,h3)上。此外,系统还可以通过建立用户C与叠加在(x3,y3,w3,h3)之上的输出位置(x3’,y3’,w3’,h3’)的对应关系来实现用户C对用户A输入的内容进行补充或修改的场景。该过程类似上述在第一种可能的场景中的实现描述,因此不再赘述。因此,在第二种可能的场景下,系统保存的声纹库的内容可以如表4B-1或表4B-2所示。
表4B-1
用户 声纹(key) 输出位置(value)
A 声纹特征A (x3,y3,w3,h3)
B 声纹特征B (x2,y2,w2,h2)
C 声纹特征C (x3,y3,w3,h3)
表4B-2
用户 声纹(key) 输出位置(value)
A 声纹特征A (x3,y3,w3,h3)
B 声纹特征B (x2,y2,w2,h2)
C 声纹特征C (x3’,y3’,w3’,h3’)
可选地,在上述两种可能的场景中,系统可以区分显示用户A和用户C输入的语音内容。具体地,可以将用户C补充追加的内容或者叠加输出的内容以不同于用户A的语音内容的格式显示在触摸屏上,例如采用与显示用户A的语音文本时不同的颜色、字体或者其他格式等显示用户C的语音内容,本申请对系统采用不同的格式显示不同用户的语音内容的方式不作限定。
可见,本申请实施例,还可以让多个用户以互相协作的方式实现通过语音在触摸屏上输入内容,提升了输入的效率和体验。
上述描述的会场中的各种实施场景并不构成本本申请实施例的限定,可以理解的是,本申请实施例还可以应用在其他的通过语音实现触摸屏输入的场景,例如在教育场景中,老师、学生之间以语音的方式在交互式电子黑板上独立或者协作输入内容,或者在家庭场景中,例如家庭成员之间以语音的方式在平板、具有触摸功能的电脑显示屏或者电子白板上独立或者协作输入内容等。
本申请实施例可以根据上述方法示例对人机交互系统进行划分。参考图1,图1示出的人机交互系统可以用于执行人机交互方法,例如用于执行图4所示的方法。所述的人机交互系统包括:触摸屏13和处理器11,存储器12。其中,
触摸屏13,用于接收触摸操作,并将该触摸操作的位置信息发送给处理器11。
处理器11,用于建立第一声纹与触摸屏13上的第一输出位置的对应关系。
处理器11,还用于接收第一语音、并在判断该语音的声纹与所述第一声纹匹配时,识别该语音的内容,并将该语音的内容输出并显示到所述第一输出位置处。
存储器12用于存储数据、指令或者程序代码。处理器11调用并执行存储器12中存储的指令或程序代码时,能够实现本申请实施例提供的人机交互方法。
在上述过程中,处理器11在上述建立第一声纹与触摸屏13上的第一输出位置的对应关系时,具体用于:接收触摸操作的位置信息,以及接收第二语音;并在判断所述触摸操作与预定的第一规则匹配、且所述第二语音与预定的第二规则匹配时,根据所述触摸操作的位置确定触摸屏13上的第一输出位置,并从所述第二语音中提取第一声纹,然后建立所述第一声纹与触摸屏13上的第一输出位置的对应关系。例如,结合图4,触摸屏13可以用于执行S402和S404,处理器11可以用于执行S406、S408,S414,S416,S422,S424以及S426。在上述过程中,处理器11在根据所述触摸操作的位置确定触摸屏13上的第一输出位置时,具体用于:根据所述触摸操作的位置中的触点位置集合确定触摸屏13上的第一输出位置的起始位置和范围。可选地,处理器11还用于输出并在触摸屏13上显示所述第一输出位置。
可选地,处理器11还用于解除第一声纹与触摸屏13上的第一输出位置的对应关系。
可选地,处理器11还用于建立第二声纹与触摸屏13上的第一输出位置的对应关系。此时,处理器11在接收到第三语音、并判断所述第三语音的声纹与所述第二声纹匹配时,识别所述第三语音的内容,并将该内容输出并显示到所述触摸屏13的所述第一输出位置的空白处或者覆盖上述第一语音的内容。可选地,在这种情况下,由于处理器11还用于在输出第三语音的内容时,选择不同于输出上述第一语音的内容时的输出格式。
可选地,当处理器11接收的所述第二语音是由阵列麦克风采集时,还用于计算第二语音的声源的位置,并在上述建立所述第一声纹与所述第一输出位置的对应关系之前,还用于判断所述第二语音的声源位置与所述触摸操作的位置是否满足预设的条件。例如,结合图4,处理器11可以用于执行S415。
可选地,处理器11还用于接收触摸屏13前方附近的图像,并在上述判断所述触摸操作与预定的第一规则匹配、且所述第一语音与预定的第二规则匹配之后,还用于实时分析和跟踪该图像,并根据图像分析和跟踪的结果判断执行所述触摸操作的用户与发出所述第一语音的用户是否为同一个用户。
可选的,触摸屏13,还用于根据处理器11的指令显示文本内容、输出位置或者滚动条。
可选的,语音采集器14可以用于采集语音,并将所采集的语音发送给处理器11。例如结合图4,语音采集器14可以用于执行S410,S412,S418以及S420。
可选地,图像采集器15可以用于采集触摸屏13前方附近的图像,并将该图像实时发送给处理器11。
关于上述可选方式的具体描述参见前述的方法实施例,此处不再赘述。此外,上述提供的任一种人机交互系统的解释以及有益效果的描述均可参考上述对应的方法实施例,不再赘述。
另外,本申请实施例可以根据上述方法示例对上述处理器,或者包含处理器的计算机设备进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采 用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
如图7所示,为本申请实施例提供的一种处理器或计算机设备的结构示意图。处理器或计算机设备用于执行上述人机交互方法,例如用于执行图4所示的方法。其中,处理器或计算机设备可以包括语音处理单元702以及综合处理单元704。
语音处理单元702用于接收第一语音,并在判断第一语音的声纹与第一声纹匹配时,识别第二语音的内容。
综合处理单元704建立所述第一声纹与第一输出位置的对应关系;还用于将所述第一语音的内容输出到所述第一输出位置。
上述处理器或计算设备还包括触点位置处理单元701,用于接收触点位置,该触点位置由触摸操作产生。触点位置处理单元701还用于在判断触点位置与预定的第一规则匹配时,根据所述触点位置确定第一输出位置。
语音处理单元702还用于接收第二语音,并在判断第二语音与预定的第二规则匹配时,从第一语音中提取第一声纹。
结合图4,触点位置处理单元701可以用于执行S406和S408,语音处理单元702可以用于执行S414、S422和S424,综合处理单元704可以用于执行S416和S426。
在上述过程中,触点位置处理单元701在根据触点位置确定第一输出位置时,具体用于:根据所述触点位置中的触点位置集合确定第一输出位置的起始位置和范围。可选地,综合处理单元705还用于输出所述第一输出位置。
可选地,综合处理单元704还用于解除第一声纹与第一输出位置的对应关系。
可选地,所述语音处理单元702还用于接收第三语音,在判断所述第三语音的声纹与第二声纹匹配时,识别所述第三语音的内容。综合处理单元704还用于建立所述第二声纹与第一输出位置的对应关系,并将所述第三语音的内容输出到所述第一输出位置,例如输出到所述第一输出位置的空白处或者覆盖上述第一语音的内容。
可选地,当语音处理单元702接收的所述第二语音是由阵列麦克风采集的时,还用于计算第二语音的声源的位置。此时,综合处理单元704还用于在上述建立第一声纹的标识与第一输出位置的对应关系之前,判断第二语音的声源的位置与所述触点位置是否满足预设的条件。例如结合图4,综合处理单元704可以用于执行S415。
可选地,本处理器或计算机设备还可以包括图像处理单元703,用于接收图像,并在所述综合处理单元704建立第一声纹的标识与第一输出位置的对应关系之前,实时分析和跟踪该图像,然后根据分析和跟踪的结果判断执行所述触摸操作的用户与发出所述第一语音的用户是否为同一个用户。
上述主要从方法的角度对本申请实施例提供的方案进行了介绍。为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
作为一个示例,结合图1,计算机设备中的触点位置处理单元701、语音处理单元7022、 综合处理单元704以及图像处理单元703实现的功能与图1中的处理器11的功能相同。
关于上述可选方式的具体描述参见前述的方法实施例,此处不再赘述。此外,上述提供的任一种处理器或计算机设备的解释以及有益效果的描述均可参考上述对应的方法实施例,不再赘述。
本申请另一实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当指令在人机交互系统或者计算机设备上运行时,该人机交互系统或者计算机设备执行上述方法实施例所示的方法流程中人机交互系统或者计算机设备执行的各个步骤。
在一些实施例中,所公开的方法可以实施为以机器可读格式被编码在计算机可读存储介质上的或者被编码在其它非瞬时性介质或者制品上的计算机程序指令。
图8示意性地示出本申请实施例提供的计算机程序产品的概念性局部视图,所述计算机程序产品包括用于在计算设备上执行计算机进程的计算机程序。
在一个实施例中,计算机程序产品是使用信号承载介质90来提供的。所述信号承载介质90可以包括一个或多个程序指令,其当被一个或多个处理器运行时可以提供以上针对图4描述的功能或者部分功能。因此,例如,参考图4中S402~S426的一个或多个特征可以由与信号承载介质90相关联的一个或多个指令来承担。此外,图8中的程序指令也描述示例指令。
在一些示例中,信号承载介质90可以包含计算机可读介质91,诸如但不限于,硬盘驱动器、紧密盘(CD)、数字视频光盘(DVD)、数字磁带、存储器、只读存储记忆体(read-only memory,ROM)或随机存储记忆体(random access memory,RAM)等等。
在一些实施方式中,信号承载介质90可以包含计算机可记录介质92,诸如但不限于,存储器、读/写(R/W)CD、R/W DVD、等等。
在一些实施方式中,信号承载介质90可以包含通信介质93,诸如但不限于,数字和/或模拟通信介质(例如,光纤电缆、波导、有线通信链路、无线通信链路、等等)。
信号承载介质90可以由无线形式的通信介质93(例如,遵守IEEE 802.11标准或者其它传输协议的无线通信介质)来传达。一个或多个程序指令可以是,例如,计算机可执行指令或者逻辑实施指令。
在一些示例中,诸如针对图4描述的人机交互系统好或者计算机设备可以被配置为,响应于通过计算机可读介质91、计算机可记录介质92、和/或通信介质93中的一个或多个程序指令,提供各种操作、功能、或者动作。
应该理解,这里描述的布置仅仅是用于示例的目的。因而,本领域技术人员将理解,其它布置和其它元素(例如,机器、接口、功能、顺序、和功能组等等)能够被取而代之地使用,并且一些元素可以根据所期望的结果而一并省略。另外,所描述的元素中的许多是可以被实现为离散的或者分布式的组件的、或者以任何适当的组合和位置来结合其它组件实施的功能实体。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件程序实现时,可以全部或部分地以计算机程序产品的形式来实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机执行指令时,全部或部分地产生按照本申请实施例的流程或功能。计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、服务器或者数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line, DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可以用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质(例如,软盘、硬盘、磁带),光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
以上所述,仅为本申请的具体实施方式。熟悉本技术领域的技术人员根据本申请提供的具体实施方式,可想到变化或替换,都应涵盖在本申请的保护范围之内。

Claims (33)

  1. 一种人机交互的方法,其特征在于,应用于人机交互系统,所述方法包括:
    建立第一声纹与触摸屏上的第一输出位置的对应关系;
    接收第一语音,在判断所述第一语音的声纹与所述第一声纹匹配时,识别所述第一语音的内容,输出并显示所述第一语音的内容到所述第一输出位置。
  2. 如权利要求1所述的方法,其特征在于,所述建立第一声纹与触摸屏上的第一输出位置的对应关系,包括:
    接收触摸操作以及第二语音;
    判断所述触摸操作与预定的第一规则匹配,且所述第二语音与预定的第二规则匹配时;
    根据所述触摸操作的位置确定所述触摸屏上的第一输出位置;
    从第二语音中提取第一声纹;
    建立所述第一声纹与所述触摸屏上的第一输出位置的对应关系。
  3. 如权利要求2所述的方法,其特征在于,所述根据所述触摸操作的位置确定所述触摸屏上的第一输出位置,包括:
    根据所述触摸操作的位置中的触点位置集合确定所述触摸屏上的第一输出位置的起始位置和范围。
  4. 如权利要求1-3任一所述的方法,其特征在于,包括:以与所述触摸屏上当前背景相区分的方式输出并在所述触摸屏上显示所述第一输出位置所指示的区域。
  5. 如权利要求2所述的方法,其特征在于,在所述建立所述第一声纹与所述触摸屏上的第一输出位置的对应关系之前,还包括:计算所述第二语音的声源的位置,判断所述位置与所述触摸操作的位置满足预设的条件。
  6. 如权利要求1-5任一所述的方法,其特征在于,还包括:解除所述第一声纹与所述触摸屏上的第一输出位置的对应关系。
  7. 如权利要求1-6任一所述的方法,其特征在于,还包括:
    建立第二声纹与触摸屏上的所述第一输出位置的对应关系;
    接收第三语音,在判断所述第三语音的声纹与所述第二声纹匹配时,识别所述第三语音的内容,输出并显示所述第三语音的内容到所述第一输出位置的空白处。
  8. 如权利要求1-6任一所述的方法,其特征在于,还包括:
    建立第二声纹与触摸屏上的所述第一输出位置的对应关系;
    接收第三语音,在判断所述第三语音的声纹与所述第二声纹匹配时,识别所述第三语音的内容,输出并显示所述第三语音的内容到所述第一输出位置,并覆盖所述第一语音的内容。
  9. 如权利要求7或8所述的方法,其特征在于,所述输出并显示所述第三语音的内容到所述触摸屏上所述第一输出位置,包括:以不同于所述输出并显示所述第一语音的内容的格式 输出并显示所述第三语音的内容。
  10. 一种人机交互的方法,其特征在于,应用于计算机设备,所述方法包括:
    建立第一声纹与第一输出位置的对应关系;
    接收第一语音,在判断所述第一语音的声纹与所述第一声纹匹配时,识别所述第一语音的内容,输出所述第一语音的内容到所述第一输出位置。
  11. 如权利要求10所述的方法,其特征在于,所述建立第一声纹与第一输出位置的对应关系,包括:
    接收触点位置以及第二语音,所述触点位置由触摸操作产生;
    在判断所述触点位置与预定的第一规则匹配,且所述第二语音与预定的第二规则匹配时,根据所述触点位置确定所述第一输出位置,从第二语音中提取第一声纹,建立所述第一声纹与所述第一输出位置的对应关系。
  12. 如权利要求11所述的方法,其特征在于,所述根据所述触点位置确定所述第一输出位置,包括:根据所述触点位置中的触点位置集合确定所述第一输出位置的起始位置和范围。
  13. 如权利要求10-12所述的方法,其特征在于,所述方法包括:以与当前背景相区分的方式输出所述第一输出位置所指示的区域。
  14. 如权利要求12所述的方法,其特征在于,在所述建立所述第一声纹与所述第一输出位置的对应关系之前,还包括:计算所述第二语音的声源的位置,判断所述位置与所述触点位置满足预设的条件。
  15. 如权利要求10-13任一所述的方法,其特征在于,还包括:解除所述第一声纹与所述第一输出位置的对应关系。
  16. 如权利要求10-13任一所述的方法,其特征在于,还包括:
    建立第二声纹与所述第一输出位置的对应关系;
    接收第三语音,在判断所述第三语音的声纹与所述第二声纹匹配时,识别所述第三语音的内容,输出所述第三语音的内容到所述第一输出位置的空白处。
  17. 如权利要求10-13任一所述的方法,其特征在于,还包括:
    建立第二声纹与所述第一输出位置的对应关系;
    接收第三语音,在判断所述第三语音的声纹与所述第二声纹匹配时,识别所述第三语音的内容,输出所述第三语音的内容到所述第一输出位置,并覆盖所述第一语音的内容。
  18. 如权利要求16或17所述的方法,其特征在于,所述输出所述第三语音的内容到所述第一输出位置,包括:以不同于所述输出所述第一语音的内容的格式输出所述第三语音的内容。
  19. 一种人机交互系统,其特征在于,所述人机交互系统包括触摸屏和处理器;
    所述触摸屏,用于接收触摸操作;
    所述处理器,用于建立第一声纹与所述触摸屏上的第一输出位置的对应关系;还用于接收第一语音,并在判断所述第一语音的声纹与所述第一声纹匹配时,识别所述第一语音的内容,输出并显示所述第一语音的内容到所述第一输出位置。
  20. 如权利要求19的系统,其特征在于,所述处理器在所述建立第一声纹与所述触摸屏上的第一输出位置的对应关系时,具体用于:
    接收第二语音;
    在判断所述触摸操作与预定的第一规则匹配、且所述第二语音与预定的第二规则匹配时,根据所述触摸操作的位置确定所述触摸屏上的第一输出位置,从所述第二语音中提取第一声纹,建立所述第一声纹与触摸屏上的第一输出位置的对应关系。
  21. 如权利要求20所述的系统,其特征在于,所述处理器在根据所述触摸操作的位置确定所述触摸屏上的第一输出位置时,具体用于:根据所述触摸操作的位置中的触点位置集合确定所述触摸屏上的第一输出位置的起始位置和范围。
  22. 如权利要求19-21任一所述的系统,其特征在于:所述处理器还用于以与当前触摸屏背景相区分的方式输出并在所述触摸屏上显示所述第一输出位置所指示的区域。
  23. 如权利要求18-22任一所述的系统,其特征在于,所述处理器在所述建立所述第一声纹与触摸屏上的第一输出位置的对应关系之前,还用于:计算所述第二语音的声源的位置,并判断所述位置与所述触摸操作的位置是否满足预设的条件。
  24. 如权利要求18-23任一所述的系统,其特征在于,所述处理器还用于:解除所述第一声纹与所述触摸屏上的第一输出位置的对应关系。
  25. 如权利要求18-24任一所述的系统,其特征在于,所述处理器还用于:
    建立第二声纹与触摸屏上的所述第一输出位置的对应关系;
    接收第三语音,在判断所述第三语音的声纹与所述第二声纹匹配时,识别所述第三语音的内容,输出并显示所述第三语音的内容到所述第一输出位置的空白处。
  26. 如权利要求18-24任一所述的系统,其特征在于,所述处理器还用于:
    建立第二声纹与触摸屏上的所述第一输出位置的对应关系;
    接收第三语音,在判断所述第三语音的声纹与所述第二声纹匹配时,识别所述第三语音的内容,输出并显示所述第三语音的内容到所述第一输出位置,并覆盖所述第一语音的内容。
  27. 如权利要求25或26所述的系统,其特征在于,所述处理器在输出并显示所述第三语音的内容时,具体用于:以不同于所述输出并显示所述第一语音的内容时的格式输出并显示所述第三语音的内容。
  28. 一种计算机设备,其特征在于,所述计算机设备包括:语音处理单元和综合处理单元;其中,
    所述语音处理单元,用于接收第一语音,在判断所述第一语音的声纹与第一声纹匹配时,识别所述第一语音的内容;
    所述综合处理单元,用于建立所述第一声纹与第一输出位置的对应关系,还用于将所述第一语音的内容输出到所述第一输出位置。
  29. 如权利要求28所述的计算机设备,其特征在于,还包括触点处理单元;
    所述触点处理单元用于接收触点位置,在判断所述触点位置中的触点位置集合与预定的第一规则匹配时,根据所述触点位置确定所述第一输出位置;
    所述语音处理单元,还用于接收第二语音,在判断第二语音与预定的第一规则匹配时,从所述第二语音中提取第一声纹。
  30. 如权利要求29所述的计算机设备,其特征在于,所述触点处理单元在根据所述触点位置确定所述第一输出位置时,具体用于:根据所述触点位置中的触点位置集合确定所述第一输出位置的起始位置和范围。
  31. 如权利要求28所述的计算机设备,其特征在于,所述语音处理单元还用于计算所述第二语音的声源的位置;
    所述综合处理单元还用于在建立所述第一声纹与所述第一输出位置的对应关系之前,判断所述第二语音的声源的位置与所述触点位置是否满足预设的条件。
  32. 如权利要求28-31任一所述的计算机设备,其特征在于,所述综合处理单元还用于:解除所述第一声纹与所述第一输出位置的对应关系。
  33. 一种计算机设备,其特征在于,包括:存储器和一个或多个处理器;所述存储器和所述处理器耦合;
    所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,当所述计算机指令被所述计算机设备执行时,使得所述计算机设备执行如权利要求1-9中任一项所述的人机交互方法。
PCT/CN2021/118906 2020-09-17 2021-09-17 人机交互方法、装置和系统 WO2022057870A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21868692.1A EP4209864A4 (en) 2020-09-17 2021-09-17 HUMAN-COMPUTER INTERACTION METHOD, APPARATUS AND SYSTEM
US18/185,203 US20230224181A1 (en) 2020-09-17 2023-03-16 Human-Computer Interaction Method and System, and Apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010983742.6 2020-09-17
CN202010983742.6A CN114281182A (zh) 2020-09-17 2020-09-17 人机交互方法、装置和系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/185,203 Continuation US20230224181A1 (en) 2020-09-17 2023-03-16 Human-Computer Interaction Method and System, and Apparatus

Publications (1)

Publication Number Publication Date
WO2022057870A1 true WO2022057870A1 (zh) 2022-03-24

Family

ID=80776496

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/118906 WO2022057870A1 (zh) 2020-09-17 2021-09-17 人机交互方法、装置和系统

Country Status (4)

Country Link
US (1) US20230224181A1 (zh)
EP (1) EP4209864A4 (zh)
CN (1) CN114281182A (zh)
WO (1) WO2022057870A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101187990A (zh) * 2007-12-14 2008-05-28 华南理工大学 一种会话机器人系统
CN101231567A (zh) * 2007-01-24 2008-07-30 北京三星通信技术研究有限公司 基于手写识别的人机交互方法和系统及运行该系统的设备
CN106776836A (zh) * 2016-11-25 2017-05-31 努比亚技术有限公司 多媒体数据处理装置及方法
WO2019100738A1 (zh) * 2017-11-24 2019-05-31 科大讯飞股份有限公司 多人参与的人机交互方法及装置

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8024185B2 (en) * 2007-10-10 2011-09-20 International Business Machines Corporation Vocal command directives to compose dynamic display text
CN101441539B (zh) * 2008-12-30 2013-06-12 华为终端有限公司 电子白板系统、输入装置、处理装置及处理方法
WO2011099086A1 (ja) * 2010-02-15 2011-08-18 株式会社 東芝 会議支援装置
KR102003255B1 (ko) * 2012-06-29 2019-07-24 삼성전자 주식회사 다중 입력 처리 방법 및 장치
CN104866745B (zh) * 2014-02-21 2018-12-14 联想(北京)有限公司 一种电子设备及信息处理方法
CN107193391A (zh) * 2017-04-25 2017-09-22 北京百度网讯科技有限公司 一种上屏显示文本信息的方法和装置
CN109309751B (zh) * 2017-07-28 2021-08-06 腾讯科技(深圳)有限公司 语音记录方法、电子设备及存储介质
JP7023743B2 (ja) * 2018-02-28 2022-02-22 シャープ株式会社 情報処理装置、情報処理方法、及びプログラム
US11152006B2 (en) * 2018-05-07 2021-10-19 Microsoft Technology Licensing, Llc Voice identification enrollment
CN110069608B (zh) * 2018-07-24 2022-05-27 百度在线网络技术(北京)有限公司 一种语音交互的方法、装置、设备和计算机存储介质
KR20200043902A (ko) * 2018-10-18 2020-04-28 삼성전자주식회사 전자 장치 및 전자 장치의 제어 방법
CN109669662A (zh) * 2018-12-21 2019-04-23 惠州Tcl移动通信有限公司 一种语音输入方法、装置、存储介质及移动终端
CN110364156A (zh) * 2019-08-09 2019-10-22 广州国音智能科技有限公司 语音交互方法、系统、终端及可读存储介质
CN110767226B (zh) * 2019-10-30 2022-08-16 山西见声科技有限公司 具有高准确度的声源定位方法、装置、语音识别方法、系统、存储设备及终端
CN111259360B (zh) * 2020-02-14 2022-03-18 珠海格力电器股份有限公司 终端设备的触摸屏状态控制方法、装置及终端设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231567A (zh) * 2007-01-24 2008-07-30 北京三星通信技术研究有限公司 基于手写识别的人机交互方法和系统及运行该系统的设备
CN101187990A (zh) * 2007-12-14 2008-05-28 华南理工大学 一种会话机器人系统
CN106776836A (zh) * 2016-11-25 2017-05-31 努比亚技术有限公司 多媒体数据处理装置及方法
WO2019100738A1 (zh) * 2017-11-24 2019-05-31 科大讯飞股份有限公司 多人参与的人机交互方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4209864A4

Also Published As

Publication number Publication date
CN114281182A (zh) 2022-04-05
EP4209864A4 (en) 2024-03-06
EP4209864A1 (en) 2023-07-12
US20230224181A1 (en) 2023-07-13

Similar Documents

Publication Publication Date Title
US20240036815A1 (en) Portable terminal device and information processing system
US9547439B2 (en) Dynamically-positioned character string suggestions for gesture typing
EP2680110B1 (en) Method and apparatus for processing multiple inputs
KR102129374B1 (ko) 사용자 인터페이스 제공 방법 및 기계로 읽을 수 있는 저장 매체 및 휴대 단말
US9324305B2 (en) Method of synthesizing images photographed by portable terminal, machine-readable storage medium, and portable terminal
US9965039B2 (en) Device and method for displaying user interface of virtual input device based on motion recognition
CN109725724B (zh) 有屏设备的手势控制方法和装置
CN104036476A (zh) 用于提供增强现实的方法以及便携式终端
KR20130007956A (ko) 그래픽 오브젝트를 이용한 컨텐츠 제어 방법 및 장치
US20140362002A1 (en) Display control device, display control method, and computer program product
US20150355740A1 (en) Touch panel system
US10761721B2 (en) Systems and methods for interactive image caricaturing by an electronic device
US11513655B2 (en) Simplified user interface generation
TW201512968A (zh) 以語音辨識來發生事件裝置及方法
US9323367B2 (en) Automatic annotation de-emphasis
US20160034027A1 (en) Optical tracking of a user-guided object for mobile platform user input
KR20150043272A (ko) 영상표시 장치의 음성제어 방법
WO2022057870A1 (zh) 人机交互方法、装置和系统
JP2022547667A (ja) ヒューマンコンピュータインタラクション方法、装置、及びシステム
CN109491732A (zh) 一种虚拟控件显示方法、装置及车载显示屏
CN114168007A (zh) 电子设备及其交互方法、可读介质
KR102278213B1 (ko) 휴대 장치 및 휴대 장치의 화면 제어방법
US11659077B2 (en) Mobile terminal and method for controlling the same
US11237671B1 (en) Temporal filter touch detection
CN114241471B (zh) 视频文本识别方法、装置、电子设备及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21868692

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021868692

Country of ref document: EP

Effective date: 20230406