CN110569806A

CN110569806A - Man-machine interaction system

Info

Publication number: CN110569806A
Application number: CN201910856202.9A
Authority: CN
Inventors: 崔浩; 雷辉; 王欣麒; 张风垠; 杨思睿; 陈鹤群; 倪钰婷
Original assignee: Shanghai Softchina Information System Consulting Co Ltd
Current assignee: Shanghai Softchina Information System Consulting Co Ltd
Priority date: 2019-09-11
Filing date: 2019-09-11
Publication date: 2019-12-13

Abstract

The invention belongs to the technical field of computers, and particularly relates to a human-computer interaction system. The human-computer interaction system comprises: the input module is used for acquiring image information and voice information of a user; the processing module is connected with the input module and comprises an image processing unit and a voice processing unit, the image processing unit is used for processing the image information according to a preset image processing model and generating first output information, and the voice processing unit is used for processing the voice information according to a preset voice processing model and generating second output information; and the output module is connected with the processing module and used for displaying the first output information and the second output information to a user. The man-machine interaction system provided by the invention can acquire and process the image information and the voice information of the user and output the reply information, thereby solving the problems that the traditional man-machine interaction system is inconvenient to operate and is difficult to meet the requirements of the user.

Description

man-machine interaction system

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a human-computer interaction system.

Background

The robot is a machine device which automatically executes work, can receive human commands, can run a pre-arranged program, and can perform actions according to principles formulated by artificial intelligence technology. The task of the robot is to assist or replace human work, and the robot is widely applied to the production industry, the construction industry or other work with higher risk.

when the robot, especially the intelligent robot, is used, the man-machine interaction system is a bridge connecting the robot and the human, the human inputs a control command to the robot through the man-machine interaction system, and the robot inputs corresponding feedback according to the command input by the human through the man-machine interaction system.

The traditional man-machine interaction system relies on key input, is inconvenient to operate and cannot meet the requirement of a user on increasing convenience.

Disclosure of Invention

The embodiment of the invention aims to provide a human-computer interaction system, and aims to solve the problems that a traditional human-computer interaction system depends on key input, is inconvenient to operate and cannot meet the requirement of a user on increasing convenience.

the embodiment of the invention is realized in such a way that a human-computer interaction system comprises:

The input module is used for acquiring image information and voice information of a user;

The processing module is connected with the input module and comprises an image processing unit and a voice processing unit, the image processing unit is used for processing the image information according to a preset image processing model and generating first output information, and the voice processing unit is used for processing the voice information according to a preset voice processing model and generating second output information;

And the output module is connected with the processing module and used for displaying the first output information and the second output information to a user.

according to the man-machine interaction system provided by the embodiment of the invention, the image information and the voice information of the user can be acquired through the input module, the image information and the voice information can be processed through the processing module to generate the corresponding reply information, and the generated reply information can be displayed to the user through the output module. Through the arrangement, the man-machine interaction system provided by the embodiment of the invention solves the problem that the traditional man-machine interaction system is inconvenient to operate through key input, and can meet the requirement of a user on increasingly improved operation convenience.

Drawings

fig. 1 is a block diagram of a human-computer interaction system according to an embodiment of the present invention;

fig. 2 is a block diagram of a human-computer interaction system according to another embodiment of the present invention;

FIG. 3 is a block diagram of a human-computer interaction system according to another embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating an image processing flow of an image processing unit according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating an image processing flow of an image processing unit according to another embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating a speech processing flow of a speech processing unit according to an embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating a speech processing flow of a speech processing unit according to another embodiment of the present invention;

FIG. 8 is a block diagram of a human-computer interaction system including a database according to an embodiment of the present invention;

FIG. 9 is a block diagram of a human-computer interaction system including a scene recognition module according to an embodiment of the present invention;

fig. 10 is a block diagram of a human-computer interaction system according to still another embodiment of the present invention.

Detailed Description

in order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms unless otherwise specified. These terms are only used to distinguish one element from another. For example, a first xx script may be referred to as a second xx script, and similarly, a second xx script may be referred to as a first xx script, without departing from the scope of the present application.

fig. 1 shows a schematic structural diagram of a human-computer interaction system provided in an embodiment of the present invention, which may specifically include:

an input module 101, configured to acquire image information and voice information of a user;

the processing module 102 is connected to the input module 101, and includes an image processing unit 1021 and a voice processing unit 1022, where the image processing unit 1021 is configured to process the image information according to a preset image processing model and generate first output information, and the voice processing unit 1022 is configured to process the voice information according to a preset voice processing model and generate second output information;

And the output module 103 is connected with the processing module 102, and is used for displaying the first output information and the second output information to a user.

In the embodiment of the present invention, the input module 101 is configured to acquire image information or voice information of a user, and it should be understood that the acquisition in the implementation of the present invention may be automatic acquisition or may be acquisition according to an operation of the user, for example, for the image information, a camera may be set to be combined with an induction device, and when it is detected that the user is in an image acquisition area, image acquisition is automatically triggered; in addition, an instruction input unit can be arranged, and when an image input instruction of a user is received, the camera is triggered to collect an image. For voice information, a voice acquisition device is combined with a sensing device, and when a user is detected to be in a voice acquisition area, voice acquisition is automatically triggered; in addition, the voice acquisition device can be triggered to acquire voice by setting the instruction input unit when receiving a voice input instruction of a user. In the embodiment of the present invention, the image information includes, but is not limited to, a face image, a pupil image, a fingerprint image, a certificate image, and the like of the user; the instruction input unit may be a key, a touch screen, or another device capable of determining whether a user performs an operation, which is not specifically limited in the embodiment of the present invention.

In the embodiment of the present invention, the processing module 102 includes an image processing unit 1021 and a voice processing unit 1022. The image processing unit 1021 is used for processing the acquired image information and generating first output information, and in the embodiment of the invention, the image processing at least comprises two processes of identification of the image information and generation of corresponding output according to the identification result; the voice processing unit 1022 is configured to process the acquired voice information and generate second input information, and in the embodiment of the present invention, the voice processing includes at least two processes of recognizing the voice information and generating a corresponding output according to a recognition result. It should be understood that, in the implementation of nomenclature of the present invention, the first output information and the second output information are different only in terms of whether the first output information and the second output information are derived from the image processing result or the voice processing result, and both the first output information and the second output information can be one or more of image, voice, indicator light and other types of information; furthermore, for a specific application of the present invention, the first output information and the second output information may also be control instructions for controlling the movement of a movable component of the system, such as a walking control instruction of a robot, and the embodiment of the present invention is not limited thereto.

In this embodiment, the output module 103 may be a display screen, a voice output device, an indicator light, and the like, and further, according to a specific application object, the output module 103 may also be in the form of various movable components, for example, the robot responds to a question of a user about orientation information through the motion of a mechanical arm, and the user takes a route by controlling the movable component, which is a specific application form of the output module 103, and the present invention is not limited thereto.

The human-computer interaction system provided by the embodiment of the invention can acquire the image information and the voice information of the user through the input module 101, process the image information and the voice information through the processing module 102 and generate corresponding reply information, and display the generated reply information to the user through the output module 103. Through the arrangement, the man-machine interaction system provided by the embodiment of the invention solves the problem that the traditional man-machine interaction system is inconvenient to operate through key input, and can meet the requirement of a user on increasingly improved operation convenience.

In one embodiment of the present invention, as shown in fig. 2, the input module 101 includes an image input unit 1011 and a voice input unit 1012;

The image input unit 1011 is used for acquiring image information of a user;

the voice input unit 1012 is used to acquire voice information of a user.

In the embodiment of the present invention, the image input unit 1011 may be a specific form of a camera, a scanning device, or the like, and the voice input unit 1012 may be a specific form of a microphone or an earphone with a microphone, or the like. The number and specific form of the image input units 1011 and the voice input units 1012 according to the embodiment of the present invention are not limited, and may be set as desired.

The human-computer interaction system provided by the embodiment of the invention can acquire the image information and the voice information of the user through the input module 101, expands the input mode of the user, can adapt to different user requirements, and is convenient and practical.

in one embodiment of the present invention, as shown in fig. 3, the input module 101 further includes an operation unit 1013;

the operation unit 1013 is configured to receive an operation by a user and generate corresponding input information according to the operation by the user;

The processing module 102 further includes an operation information processing unit, where the operation information processing unit is configured to process the operation information according to a preset rule and generate corresponding third output information;

the output module 103 is further configured to display the third output information to a user.

in this embodiment of the present invention, the operation unit 1013 may be a key, a touch screen, or another device capable of determining whether a user has performed an operation, such as a gravity sensing device, a pressure sensing device, an infrared sensing device, and the like, and correspondingly, the operation information may be key information, touch information of the touch screen, or information generated by gravity sensing, pressure sensing, infrared sensing, and the like, which is not specifically limited in this embodiment of the present invention. In the embodiment of the present invention, the processing module 102 further includes an operation information processing unit for processing the operation information input by the user through the operation unit 1013 and generating third output information. In the embodiment of the present invention, it is understood that the third output information is only used for distinguishing the source, and may be in the form of image, voice, indicator light or other system control information.

The man-machine interaction system provided by the embodiment of the invention can accept the operation input of a user by arranging the operation unit 1013, and the operation input can be suitable for more complex instructions compared with the image and voice input.

in an embodiment of the present invention, as shown in fig. 4, the image processing unit is configured to process the image information according to a preset image processing model and generate first output information, and specifically includes the following steps:

Extracting characteristic information of the image information by using an image recognition algorithm;

processing the characteristic information by using a convolutional neural network model, and performing multi-mode matching on a processing result to generate reply information corresponding to the image information;

and generating a reply voice as the first output information according to the reply information corresponding to the image information and transmitting the reply voice to the output module.

in the embodiment of the invention, firstly, the feature information of the image information is extracted through an image recognition algorithm, the feature information is information from which the content pointed by the image or the real intention of the user can be deduced in the image, and for the same image, the obtained feature information can be different due to different recognition algorithms. The recognition algorithm may be selected from various existing recognition algorithms, for example, a Speeded Up Robust Features algorithm (SURF algorithm), a Scale-invariant feature transform (Scale-invariant feature transform) algorithm, and the like may be adopted.

In the embodiment of the present invention, the extracted feature information may be processed by using a convolutional neural network model, so as to determine a keyword corresponding to the feature information, and then determine reply information corresponding to the determined keyword in the preset database 104 by using a multi-pattern matching algorithm according to the determined keyword, and transmit the reply information as the first reply information to the output module 103. It should be understood that the first reply information may be output in the form of an image, a voice, an indicator light, etc., and may also be output in the form of a control system of the system, for example, in the form of a control instruction of the robot arm, etc.

According to the human-computer interaction system provided by the embodiment of the invention, the image information is processed through the image recognition algorithm and the convolutional neural network model, the key information in the image information can be more accurately extracted, and the corresponding reply is generated according to the processing result, so that the system can more accurately and quickly process the image information, and the accuracy of understanding the intention of a user is improved.

In an embodiment of the present invention, as shown in fig. 5, the processing the feature points by using the convolutional neural network model further includes the following steps:

Generating recommendation information corresponding to the image information by using a data recommendation system according to a processing result;

and generating question voice according to the recommendation information corresponding to the image information and transmitting the question voice to the output module.

In the embodiment of the invention, the image information input by the user is acquired, and besides the corresponding reply information is generated according to the image result, the corresponding recommendation information can be generated by utilizing the data recommendation system according to the processing result. In the embodiment of the present invention, the recommendation information is derived from a preset database, common algorithms of the recommendation system include a recommendation algorithm based on static information, a recommendation algorithm based on features based on content items, a deep learning algorithm, and the like, and the specific algorithm is not specifically limited in this respect.

in the embodiment of the present invention, it may be understood that the recommendation information may be presented to the user in specific forms of image information, voice information, indicator light control information, and the like, which is not specifically limited in the embodiment of the present invention.

the man-machine interaction system provided by the embodiment of the invention can process the atlas information to generate a corresponding reply and can generate related recommendation information according to a processing result, thereby providing more thorough and comprehensive service for a user.

in an embodiment of the present invention, as shown in fig. 6, the speech processing unit is configured to process the speech information according to a preset speech processing model and generate second output information, and specifically includes the following steps:

converting the voice information into text information;

processing the text information by using a convolutional neural network model, and performing multi-mode matching on a processing result to generate reply information corresponding to the voice information;

and generating reply voice as the second output information according to the reply information corresponding to the voice information, and transmitting the reply voice to the output module.

in the embodiment of the present invention, firstly, the speech information is converted into the text information by the speech recognition algorithm, and the text information is processed by using the convolutional neural network model, so as to determine the keyword corresponding to the text information, and according to the determined keyword, the corresponding reply information in the preset database 104 is determined by using the multi-pattern matching algorithm and is transmitted to the output module 103 as the second reply information. It should be understood that the second reply message may be output in the form of an image, a voice, an indicator light, etc., and may also be output in the form of a control system of the system, for example, in the form of a control command of the robot arm, etc. In the embodiment of the present invention, the text information is processed by using the convolutional neural network model, and the extraction of the feature information of the text information is also included before the processing, and the extracted algorithm belongs to the prior art, which is not described herein again.

According to the human-computer interaction system provided by the embodiment of the invention, the voice information is processed through the voice recognition and the convolutional neural network model, the key information in the voice information can be more accurately extracted, and the corresponding reply is generated according to the processing result, so that the system can more accurately and quickly process the voice information, and the accuracy of understanding the intention of the user is improved.

in an embodiment of the present invention, as shown in fig. 7, the processing the text information by using the convolutional neural network model further includes the following steps:

Generating recommendation information corresponding to the voice information by using a data recommendation system according to a processing result;

and generating question voice according to the recommendation information corresponding to the voice information and transmitting the question voice to the output module.

in the embodiment of the invention, the voice information input by the user is acquired, and besides the corresponding reply information is generated according to the voice result, the corresponding recommendation information can also be generated by using the data recommendation system according to the processing result. In the embodiment of the present invention, the recommendation information is derived from the preset database 104, common algorithms of the recommendation system include a recommendation algorithm based on static information, a recommendation algorithm based on features, a recommendation algorithm based on content items, a deep learning algorithm, and the like, and the specific algorithm is not specifically limited in this embodiment of the present invention.

the man-machine interaction system provided by the embodiment of the invention can process the voice information to generate the corresponding reply and can generate the related recommendation information according to the processing result, thereby providing more thorough and comprehensive service for the user.

in an embodiment of the present invention, as shown in fig. 8, the system further includes a database 104, where the database 104 includes an image database 1041, a corpus database 1042, and a knowledge database 1043;

The image database 1041 stores image data for providing a matching image for the image processing unit;

The corpus database 1042 stores corpus data for providing matching corpus for the voice processing unit;

The knowledge database 1043 stores preset keywords and corresponding relations between the keywords, and corresponding relations between the image information and the atlas information, and is used for providing raw data for the processing module 102 to generate recommendation information.

In the embodiment of the present invention, it should be understood that the data in the image database 1041 and the corpus database 1042 may be stored in the form of an image and a corpus respectively, or may be stored in the form of feature information corresponding to the image and feature information corresponding to the corpus respectively, where the latter data amount is smaller and the requirement for storage is lower. In the embodiment of the present invention, when the processing module 102 processes the acquired image information or voice information, the acquired image information or voice information needs to be matched with data pre-stored in the image database 1041 and the corpus database 1042, and the matching may be used to determine whether the acquired image information or voice information belongs to existing data or determine the type of the data. For example, for a certain piece of acquired image information, the image data can be judged to be certificate type data by comparing with the image database 1041, and accordingly, it can be judged that the user intention may be authentication, login request, and the like; for another example, after text conversion is performed on a piece of voice data, a plurality of keywords are extracted from the voice data, and matching with the corpus database 1042 can determine that the plurality of keywords belong to keywords related to "direction", so that it can be determined that the intention of the user may be position information confirmation, request navigation, location query, and the like.

in this embodiment of the present invention, the knowledge database 1043 is configured to store corresponding relationships between keywords and keywords, and between image information and image information, and through the preset corresponding relationships, when image information or voice information is acquired, contents such as a category to which the acquired image information or voice information belongs, recommendation information related to the acquired image information or voice information belongs, and corresponding recommendation information or reply information may be generated according to the corresponding contents.

the man-machine interaction system provided by the embodiment of the invention also comprises a database 104, and the database 104 can provide corresponding comparison reference for processing image information and voice information, and can also generate recommendation information by utilizing the corresponding relation between the keywords stored in the knowledge base, thereby providing more thorough and comprehensive service for users.

In an embodiment of the present invention, as shown in fig. 9, the system further includes a scene recognition module, where the scene recognition module is configured to determine a session scene according to the image information and/or the voice information;

the image processing module 1021 is specifically configured to process the image information according to the determination result of the scene recognition module and a preset image processing model, and generate the first output information;

the speech processing module 1022 is specifically configured to process the speech information according to the determination result of the scene recognition module and a preset speech processing model, and generate the second output information.

In the embodiment of the invention, the scene identification module is used for judging the session scene. Because the reply of the information, the question and the like are directly related to the scene, the judgment of the scene of the conversation can improve the pertinence of the reply or the question. It should be understood that in the embodiment of the present invention, the session context does not refer to a specific session counter-point, but refers to a topic of the session, such as a specific context of navigation, shopping, price, after sale, etc.

The human-computer interaction system provided by the embodiment of the invention also comprises a scene identification module, and the intention of the user can be judged more quickly and accurately by identifying the scene of the session, so that corresponding reply can be made and corresponding information can be recommended according to the intention of the user, the pertinence of the reply can be improved, and the user experience can be improved.

in one embodiment, as shown in fig. 10, the output module 103 includes an image output unit and a voice output unit;

the image output unit 1031 is configured to output an image included in the first output information and/or the second output information;

The voice output unit 1032 is configured to output a voice included in the first output information and/or the second output information.

in the embodiment of the present invention, the image output unit may be a display screen, an indicator light array, or the like, and the voice output unit may be a speaker, an earphone, or the like. The embodiment of the present invention is not particularly limited with respect to the specific arrangement form, number, and the like of the image output unit 1031 and the voice input unit 1032.

The input module 101 of the man-machine interaction system provided by the embodiment of the invention comprises an image output unit and a voice output unit, which can be respectively used for outputting images and voices, can realize more convenient and direct man-machine communication, and is suitable for different user groups.

it should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in various embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database 104, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. a human-computer interaction system, characterized in that the system comprises:

2. the human-computer interaction system of claim 1, wherein the input module comprises an image input unit and a voice input unit;

The image input unit is used for acquiring image information of a user;

The voice input unit is used for acquiring voice information of a user.

3. A human-computer interaction system according to claim 1 or 2, wherein the input module further comprises an operation unit;

the operation unit is used for receiving the operation of a user and generating corresponding input information according to the operation of the user;

The processing module further comprises an operation information processing unit, and the operation information processing unit is used for processing the operation information according to a preset rule and generating corresponding third output information;

the output module is further configured to display the third output information to a user.

4. the human-computer interaction system of claim 1, wherein the image processing unit is configured to process the image information according to a preset image processing model and generate first output information, and specifically includes the following steps:

5. The human-computer interaction system of claim 4, wherein the processing of the feature points by the convolutional neural network model further comprises the following steps:

6. the human-computer interaction system according to claim 1, wherein the speech processing unit is configured to process the speech information according to a preset speech processing model and generate second output information, and specifically includes the following steps:

Converting the voice information into text information;

and generating reply voice as the second output information according to the reply information corresponding to the voice information and transmitting the reply voice to the output module.

7. the human-computer interaction system of claim 6, wherein the processing of the text information using the convolutional neural network model further comprises the following steps:

8. The human-computer interaction system of claim 1, wherein the system further comprises a database, the database comprising an image database, a corpus database, and a knowledge database;

The image database stores image data and is used for providing a matching image for the image processing unit;

the corpus database stores corpus data used for providing matched corpus for the voice processing unit;

the knowledge base stores preset keywords and corresponding relations among the keywords, and corresponding relations among the image information and the atlas information, and is used for providing original data for the processing module to generate the recommendation information.

9. the human-computer interaction system of claim 1, further comprising a scene recognition module for determining a conversation scene according to the image information and/or the voice information;

the image processing module is specifically configured to process the image information according to a judgment result of the scene recognition module and a preset image processing model and generate the first output information;

The voice processing module is specifically configured to process the voice information according to the judgment result of the scene recognition module and a preset voice processing model, and generate the second output information.

10. The human-computer interaction system of claim 1, wherein the output module comprises an image output unit and a voice output unit;

the image output unit is used for outputting the image contained in the first output information and/or the second output information;

the voice output unit is used for outputting voice contained in the first output information and/or the second output information.