CN110895931A

CN110895931A - VR (virtual reality) interaction system and method based on voice recognition

Info

Publication number: CN110895931A
Application number: CN201910986351.7A
Authority: CN
Inventors: 刘雨松
Original assignee: Suzhou Yi Neng Tong Information Technology Co Ltd
Current assignee: Suzhou Yi Neng Tong Information Technology Co Ltd
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2020-03-20

Abstract

The invention relates to the related field of voice recognition systems, and discloses a VR (virtual reality) interaction system based on voice recognition, which comprises a cloud and a VR peripheral end, wherein the cloud comprises a voice recognition module, a semantic recognition module, a scene processing module, a storage module and a communication module, the VR peripheral end comprises a display module, a voice input module and a voice input module, and the VR peripheral end also comprises the communication module, and the invention also discloses a method for the VR interaction system based on voice recognition, which comprises the following steps: constructing a knowledge base dialog library; opening a cloud end and a VR peripheral end; the user wears the VR peripheral; a user input; and (6) cloud processing. The method effectively overcomes the defects of poor interactivity and strong drawing-off feeling of the existing VR product, and realizes more natural interactive experience of people and virtual scene characters.

Description

VR (virtual reality) interaction system and method based on voice recognition

Technical Field

The invention relates to the related field of voice recognition systems, in particular to a VR (virtual reality) interaction system and method based on voice recognition.

Background

VR, referred to as virtual reality technology for short, is an important direction of simulation technology, and is a collection of various technologies such as simulation technology, computer graphics man-machine interface technology, multimedia technology, sensing technology, network technology, and the like, which is a challenging leading discipline and research field of cross technology. Virtual reality technology (VR) mainly includes aspects of simulating environment, perception, natural skills, sensing equipment and the like. The simulated environment is a three-dimensional realistic image generated by a computer and dynamic in real time. Perception means that an ideal VR should have the perception that everyone has. In addition to the visual perception generated by computer graphics technology, there are also perceptions such as auditory sensation, tactile sensation, force sensation, and movement, and even olfactory sensation and taste sensation, which are also called multi-perception. The natural skill refers to the head rotation, eyes, gestures or other human body behavior actions of a human, and data adaptive to the actions of the participants are processed by the computer, respond to the input of the user in real time and are respectively fed back to the five sense organs of the user. The sensing device refers to a three-dimensional interaction device.

Virtual reality was proposed by the company vpl in the united states in the 80's of the 20 th century. The concrete connotations are as follows: a technique for providing an immersive sensation in an interactive three-dimensional environment generated on a computer by comprehensively utilizing a computer graphics system and various interface devices for reality and control. The virtual reality technology is a computer simulation system which can create and experience a virtual world, and the computer simulation system generates a system simulation of interactive three-dimensional dynamic visual and entity behaviors simulating multi-source information fusion by using a computer so as to immerse a user in the environment.

VR technology has a wide prospect in medical treatment, education, real estate and design. At present, VR interaction technology mainly relies on motion capture and gesture recognition, and user experience is not good, so that voice interaction becomes a strong appeal for users under the condition. Speech recognition techniques are now largely divided into two directions, namely traditional acoustic models and deep learning models. The traditional speech recognition technology, i.e. acoustic model, generates a model by extracting the audio features of the speaker under the simulation of some algorithms. The deep learning model is a technology which rises rapidly in recent years, the current comparative fire is a hidden Markov model based on a deep neural network, and the technology simulates a discriminability model based on data calculation. With the continuous progress of an algorithm and the continuous upgrade of hardware, the advantages of a deep learning model are more and more obvious, a voice recognition model based on deep learning is adopted, the existing VR product based on the voice recognition model at present is poor in interactivity and strong in drawing sense, and the natural interactive experience of people and virtual scene characters cannot be realized and needs to be improved.

Disclosure of Invention

The present invention is directed to a VR interaction system and method based on speech recognition, so as to solve the problems in the background art.

In order to achieve the purpose, the invention provides the following technical scheme: a VR (virtual reality) interaction system based on voice recognition comprises a cloud end and a VR peripheral end, wherein the cloud end comprises a voice recognition module, a semantic recognition module, a scene processing module, a storage module and a communication module, the VR peripheral end comprises a display module, a voice input module and a voice input module, and the VR peripheral end also comprises a communication module;

the voice recognition module is mainly used for primarily processing the voice of a user, namely extracting voice characteristics by means of noise reduction and reverberation removal on the basis of the voice input module, and then generating and checking a voice model by means of an algorithm based on deep learning, wherein the voice recognition module uses a plurality of algorithms and processing tools and is connected with the semantic recognition module;

the semantic recognition module carries out semantic processing again on the basis of the voice recognition module and deduces the user intention, the part needs to be analyzed according to the context to improve the accuracy, and the semantic recognition module is connected with the scene processing module;

the scene processing module analyzes the recognition result of the semantic recognition module, adjusts the layout transformation of the scene according to the result, and outputs the result through the display module, which needs the module to call a knowledge base in the storage module for relevant processing, and the scene processing module is connected with the storage module and the display module;

the scene processing module is used for calling the required dialog base knowledge base stored in the storage module according to the result of the previous step and outputting the dialog base through the voice output module, and the knowledge base is output through the display module;

the voice input module comprises a plurality of audio input devices and is connected with the voice output module;

the voice output module carries out voice output on the result in the storage module;

the communication module is responsible for communication among the peripheral devices.

Preferably, the audio input device of the voice input module comprises a microphone.

Preferably, the device of the voice input module comprises an earphone power amplifier.

A method of a VR interactive system based on speech recognition, comprising the following method steps:

constructing a knowledge base dialog library: firstly, storing a corresponding dialogue database in a storage module;

open high in the clouds and VR peripheral hardware end: after the cloud end and the VR peripheral end are opened, the communication module is ensured to be normal;

the user wears the vR peripheral: the user can feel a virtual scene after wearing the vR peripheral;

and (3) user input: a user inputs voice according to the prompt of the virtual scene or actively through the audio input peripheral;

cloud processing: through the processing at the cloud, the user can receive response information through the earphone at the vR terminal, and meanwhile response actions and expressions of the virtual scene are obtained from the display equipment of the display module.

Preferably, the method comprises the specific application based on the method steps, when the method is used, a user inputs audio through input equipment such as a microphone and the like, the audio is transmitted to the cloud, the voice recognition module is used for carrying out voice recognition to initially obtain user information, then the semantic recognition module is used for carrying out semantic recognition, the cloud understands a user instruction and deduces the intention of the user, and then the user instruction is processed in the scene processing module according to the intention of the user, and the result is transmitted to the display module; meanwhile, a knowledge base in the storage module is called, a corresponding result is returned to the VR peripheral end, and the user can listen to the dialogue information through the output equipment of the VR peripheral end, so that the user can watch the dialogue information through the display module and listen to the dialogue information through the voice input module and the audio equipment corresponding to the voice output module, double feedback of vision and hearing is obtained, and more immersion is achieved.

Compared with the prior art, the invention has the beneficial effects that: in the invention, the user can communicate with the character in the virtual scene through the voice input module on the VR peripheral end, such as mi_cThe voice recognition module firstly reduces noise, removes reverberation, and the like to remove interference factors in the surrounding environment, then extracts voice characteristics, carries out analysis modeling through a deep learning algorithm based on a deep neural network to generate a voice model, then compares and recognizes voice information input by a user, analyzes the content and instruction information of the voice information of the user, enters a semantic recognition module on the basis of the voice recognition, carries out NLP word segmentation, keyword analysis and the like on the basis of the voice recognition by combining a context environment to further deduce the possible intention of the user, enters a scene processing module, and carries out scene processing according to the result in the semantic recognition module, the module can call a knowledge base in a storage module to perform corresponding scene response, the module comprises graph adjustment, context processing and the like, the feedback to the output is the action transformation of a virtual character, the processing results are transmitted to a display module on a VR peripheral end through a data line or other modes, and corresponding voice information is output at the same timeAnd (3) the action is carried out, so that the double perception of vision and hearing can be achieved, and the immersion of the user is greatly enhanced.

The method effectively overcomes the defects of poor interactivity and strong drawing-off feeling of the existing VR product, and realizes more natural interactive experience of people and virtual scene characters.

Drawings

Fig. 1 is a schematic diagram of a module structure according to the present invention.

In the figure: 1. a cloud end; 2. a VR peripheral end; 3. a voice recognition module; 4. a semantic recognition module; 5. a scene processing module; 6. a storage module; 7. a display module; 8. a voice input module; 9. and a voice output module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

Referring to fig. 1, the present invention provides a technical solution: a VR interactive system based on voice recognition comprises a cloud end 1 and a VR peripheral end 2, wherein the cloud end 1 comprises a voice recognition module 3, a semantic recognition module 4, a scene processing module 5, a storage module 6 and a communication module, the VR peripheral end 2 comprises a display module 7, a voice input module 8 and a voice input module 8, and the VR peripheral end 2 also comprises a communication module;

the voice recognition module 3 mainly performs primary processing on the voice of a user, namely on the basis of the voice input module 8, extracts voice features in a noise reduction and reverberation removal mode, and then generates and checks a voice model through an algorithm based on deep learning, wherein the voice recognition module 3 is connected with the semantic recognition module 4 by using a plurality of algorithms and processing tools;

the semantic recognition module 4 carries out semantic processing again on the basis of the voice recognition module 3 and deduces the user intention, and the part needs to be analyzed according to the context to improve the accuracy, and the semantic recognition module 4 is connected with the scene processing module 5;

the scene processing module 5 analyzes the recognition result of the semantic recognition module 4, adjusts the layout transformation of the scene according to the result, and outputs the result through the display module 7, which needs the module to call a knowledge base in the storage module 6 for relevant processing, and the scene processing module 5 is connected with the storage module 6 and the display module 7;

the storage module 6 is used for storing a knowledge base and a conversation base, the scene processing module 5 outputs the required conversation base knowledge base which is called and stored in the storage module 6 according to the result of the previous step, the conversation base is output through the voice output module 9, and the knowledge base is output through the display module 7;

the voice input module 8 comprises a plurality of audio input devices, and the voice input module 8 is connected with the voice output module 9;

the voice output module 9 outputs the result in the storage module 6 in voice;

Further, the audio input device of the voice input module 8 includes a microphone, and the device of the voice input module 8 includes an earphone power amplifier.

The principle method based on the system in the embodiment comprises the following steps:

constructing a knowledge base dialog library: firstly, storing a corresponding dialogue database in a storage module 6;

open high in the clouds 1 and VR peripheral hardware end 2: after the cloud 1 and the VR peripheral end 2 are started, the communication module is ensured to be normal;

the user wears the VR peripheral: the user can feel the virtual scene after wearing the VR peripheral;

cloud 1 processing: through the processing at the cloud 1, the user receives the response information at the VR terminal through the earphone, and obtains the response action and the expression of the virtual scene from the display device of the display module 7.

Based on the specific application of the steps of the method, the specific use steps are as follows: when the system is used, a user inputs audio through input equipment such as a microphone and the like, the audio is transmitted to the cloud 1, the voice recognition module 3 is used for voice recognition to obtain user information primarily, the semantic recognition module 4 is used for semantic recognition, the cloud 1 understands a user instruction and deduces the intention of the user, the user instruction is processed in the scene processing module 5 according to the intention of the user, and the result is transmitted to the display module 7; meanwhile, a knowledge base in the storage module 6 is called, a corresponding result is returned to the VR peripheral end 2, and the user can listen to the dialogue information through the output equipment of the VR peripheral end 2, so that the user can watch the dialogue information through the display module 7 and listen to the dialogue information through the audio equipment corresponding to the voice input module 8 and the voice output module 9, double feedback of vision and hearing is obtained, and more immersion is achieved.

It should be noted that the present invention is not limited to the cloud and the VR peripheral, but refers to all devices of which the scene control system and the intelligent voice system are independent from the VR peripheral, and may also be the cloud, etc., and for the purpose of presentation, the present invention takes the cloud as an example for convenience of understanding.

A general core processor is independent of a VR peripheral end, a processor with high computing performance is needed due to the fact that a large amount of computing data are needed, at the present stage, the processor cannot meet the requirement of an all-in-one machine integrating peripherals, so that the idea is changed, a brand-new method for placing the processing process in the cloud is provided, the mode has the advantage that the processing process in the cloud can have better networking performance, and the processing method is more suitable for processing of big data.

Under the idea of the new method, in the invention, the user communicates with the character in the virtual scene through the voice input module 8 on the VR peripheral 2, for example, mi_cThe voice recognition module 3 firstly reduces noise, removes reverberation, and the like to remove interference factors in the surrounding environment, then extracts voice characteristics, carries out analysis modeling through a deep learning algorithm based on a deep neural network to generate a voice model, then compares and recognizes the voice information input by a user, analyzes the content and instruction information of the voice information of the user, enters the semantic recognition module 4 on the basis of the voice recognition, carries out NLP word segmentation, keyword analysis and the like on the basis of the voice recognition by the cloud 1, deduces possible intentions of the user by combining a context environment, enters the scene processing module 5, and carries out scene processing according to the result in the semantic recognition module 4, the module can call a knowledge base in the storage module 6, corresponding scene response is carried out, the image adjustment is included, context processing and the like are carried out, the feedback to the output is the action transformation of virtual characters, the processing results are transmitted to the display module 7 on the VR peripheral end 2 through a data line or other modes, corresponding voice information is output at the same time, the storage module 6 comprises a knowledge graph base, a conversation base and the like, the cloud end is on the basis of the scene processing and returns response conversation from the conversation base of the storage module 6, meanwhile, the cloud end 1 is used for carrying out relevant scene processing in the scene processing module 5, according to instructions of a user, if expressions or actions of response are carried out, the double perception of visual sense and auditory sense can be achieved, and the immersion sense of the user is greatly enhanced.

Through the process analysis, the defects of poor interactivity and strong extraction sense of the existing VR products are effectively overcome, and more natural interactive experience of people and virtual scene characters is realized.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative examples and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any icons in the claims should not be construed as limiting the claim concerned.

Claims

1. A VR interactive system based on speech recognition is characterized in that: the device comprises a cloud end (1) and a VR (virtual reality) peripheral end (2), wherein the cloud end (1) comprises a voice recognition module (3), a semantic recognition module (4), a scene processing module (5), a storage module (6) and a communication module, the VR peripheral end (2) comprises a display module (7), a voice input module (8) and a voice input module (8), and the VR peripheral end (2) also comprises the communication module;

the voice recognition module (3) mainly performs primary processing on the voice of a user, namely on the basis of the voice input module (8), voice features are extracted in a noise reduction and reverberation removal mode, then a voice model is generated and checked through an algorithm based on deep learning, a plurality of algorithms and processing tools are used in the voice feature extraction and reverberation removal mode, and the voice recognition module (3) is connected with the semantic recognition module (4);

the semantic recognition module (4) carries out semantic processing again on the basis of the voice recognition module (3) and deduces the user intention, and the part needs to be analyzed according to the context to improve the accuracy, and the semantic recognition module (4) is connected with the scene processing module (5);

the scene processing module (5) analyzes the recognition result of the semantic recognition module (4), adjusts the layout transformation of the scene according to the result, and outputs the result through the display module (7), which needs the module to call a knowledge base in the storage module (6) for relevant processing, and the scene processing module (5) is connected with the storage module (6) and the display module (7);

the scene processing module (5) outputs the required dialog library knowledge base which is called and stored in the memory module (6) according to the result of the previous step, the dialog library is output through the voice output module (9), and the knowledge base is output through the display module (7);

the voice input module (8) comprises a plurality of audio input devices, and the voice input module (8) is connected with the voice output module (9);

the voice output module (9) outputs the result in the storage module (6) in a voice mode;

2. The VR interaction system of claim 1, wherein: the audio input device of the voice input module (8) comprises a microphone.

3. The VR interaction system of claim 1, wherein: the equipment of the voice input module (8) comprises an earphone power amplifier.

4. The method for a voice recognition based VR interaction system of any of claims 1-3, further comprising: the method comprises the following steps:

constructing a knowledge base dialog library: firstly, storing a corresponding dialog library in a storage module (6);

open high in the clouds (1) and VR peripheral end (2): after the cloud end (1) and the VR peripheral end (2) are started, the communication module is ensured to be normal;

cloud (1) processing: through the processing at the cloud (1), the user can receive response information through the earphone at the VR terminal, and meanwhile response actions and expressions of the virtual scene are obtained from the display equipment of the display module (7).

5. The method of claim 4, wherein the method comprises: the method comprises the specific application based on the method steps, when the method is used, a user inputs audio through input equipment such as a microphone and the like, the audio is transmitted to a cloud end (1), the voice recognition module (3) is used for carrying out voice recognition to obtain user information preliminarily, then the semantic recognition module (4) is used for carrying out semantic recognition, the cloud end (1) understands a user instruction and deduces the intention of the user, then the user instruction is processed in a scene processing module (5) according to the intention of the user, and the result is transmitted to a display module (7); meanwhile, a knowledge base in the storage module (6) is called, a corresponding result is returned to the VR peripheral end (2), and the user can listen to the dialogue information through the output equipment of the VR peripheral end (2), so that the user can watch the dialogue information through the display module (7) and listen to the dialogue information through the audio equipment corresponding to the voice input module (8) and the voice output module (9), double feedback of vision and hearing is obtained, and more immersion is achieved.