CN113763940A

CN113763940A - Voice information processing method and system for AR glasses

Info

Publication number: CN113763940A
Application number: CN202110922820.6A
Authority: CN
Inventors: 苗顺平
Original assignee: Beijing Ileja Tech Co ltd
Current assignee: Beijing Ileja Tech Co ltd
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2021-12-07

Abstract

The application discloses a voice information processing method and system for AR glasses. The method comprises the steps of receiving sound through a microphone array consisting of two or more microphones; recognizing the voice of a speaker through local algorithm operation or cloud computing; recognizing the voice data as text information through a voice processing algorithm; the data are sent to a cloud server through a wireless network connection module, and information returned by the server is received; and displaying the received returned processing information of the cloud server on the AR glasses. According to the method and the device, the technical problems that voice recognition and translation on the mobile phone need to be finished by manually clicking, objects cannot be recognized, the influence of ambient noise on the voice recognition can be reduced by a quiet environment or a close environment, real-time operation is needed, and encumbrance is caused are solved.

Description

Voice information processing method and system for AR glasses

Technical Field

The application relates to the technical field of computers, in particular to a voice information processing method and system for AR glasses.

Background

The voice recognition function is more and more common on the mobile phone at present, and gradually becomes a new input mode, so that the problem of a plurality of interactive scenes is solved. However, when the voice recognition function of the mobile phone is used, it is necessary to approach the mobile phone or ensure the silence of the surrounding environment to achieve a good recognition effect, and if the content of the opposite party of the conversation is to be recognized, it is embarrassing to reach the opposite party.

At present, the prior art on the market has the following defects:

(1) the voice recognition and translation on the mobile phone needs to be started and ended by manual clicking, and the opposite side or the voice content of the mobile phone cannot be recognized;

(2) the mobile phone identification and translation requires a quiet environment or close to the environment to solve the influence of ambient noise on voice identification.

(3) When the mobile phone is used for voice recognition and translation, real-time operation is needed, and the sense of encumbrance is provided.

Aiming at the problems that in the related technology, the voice recognition and translation on the mobile phone needs to be started and ended by manual clicking, an object cannot be recognized, the influence of ambient noise on the voice recognition can be reduced only by a quiet environment or close to the quiet environment, real-time operation is needed, and the voice recognition is cumbersome, an effective solution is not provided at present.

Disclosure of Invention

The present application mainly aims to provide a method and a system for processing voice information of AR glasses, so as to solve the above problems.

In order to achieve the above object, according to one aspect of the present application, there is provided a voice information processing method for AR glasses.

The voice information processing method for the AR glasses according to the application comprises the following steps:

receiving sound through a microphone array consisting of two or more microphones;

recognizing the voice of a speaker through local algorithm operation or cloud computing;

recognizing the voice data as text information through a voice processing algorithm;

the data are sent to a cloud server through a wireless network connection module, and information returned by the server is received;

and displaying the received returned processing information of the cloud server on the AR glasses.

Furthermore, the system also comprises working logic which executes arithmetic operation through the integrated operation unit and controls the whole system.

Further, the microphone array sound reception formed by two or more microphones specifically includes:

the directional acquisition of human voice and environmental sound is realized through a physical structure, and the acuity of directional audio is improved;

the algorithmic enhancement of human voice is achieved through beam forming techniques.

Furthermore, the directional acquisition of human voice and environmental sound is realized through a physical structure, including acquiring sounds in different directions by adjusting different directions of the microphone and matching with an external sound receiving structure, wherein the sound receiving structure includes but is not limited to one or more of a specially designed cylindrical or conical structure and a sealing mechanism.

Furthermore, the audio frequency in a specific angle in front of the glasses wearer is directionally enhanced, and the sound in other areas is weakened, so that the voice of the interlocutor can be ensured to be clearly heard.

Further, the algorithm enhancement of human voice by beam forming technology includes:

forming directivity by using a beam forming technology, and converting the sound in the direction of the pointing axis into the direction of the pointing target so as to reduce the interference noise in the environment;

the stationary noise remaining in the axial direction is reduced by the noise suppressing function.

Further, the recognizing of the voice of the dialog person through the local algorithm operation or the cloud computing further includes:

translating the speaker's speech into other languages;

the wearer's voice is translated into the language or text of the conversant.

Further, the recognizing the voice data as text information also includes a voice recognition algorithm.

Further, the sending data to the cloud server and receiving information returned by the server further includes:

and the device is connected with other devices and is connected with the cloud server to send and receive data.

In order to achieve the above object, according to another aspect of the present application, there is provided AR glasses having a voice information processing system for recognizing a voice of a counterpart.

According to the AR glasses of the application, the voice information processing system for recognizing the voice of the opposite side comprises:

the display module comprises 1 or two pieces of lens units displayed by AR;

the transmission module comprises one or more of Wi-Fi, Bluetooth and a mobile network;

the sensor comprises a microphone;

the battery module comprises a battery and a power supply management unit;

and the control module comprises a computing unit of the AR glasses and a user interaction control unit.

Further, the display module may further include one or more of a display screen other than the AR display unit, not limited to the indicator lamp.

Further, the sensor also includes one or more of, but not limited to, an RGB camera, a TOF camera, a laser radar, a gyroscope, a gravity accelerometer, a geomagnetic sensor, a distance sensor, and a speaker.

Further, the control module:

the computing unit of the AR glasses comprises one or more of a CPU, a memory and a storage;

the user interaction control unit comprises one or more of keys, a touch screen, a vibration sensor and a remote controller.

In the embodiment of the application, the microphone array technology is adopted, the sound source positioning and the voiceprint recognition are realized, the sound source positioning and the noise reduction function are realized through the beam forming technology, the directional acquisition of human voice and environmental sound is realized through a physical structure, the acuteness of directional audio is improved, the communication is more efficient, the technical effect of the mobile phone is not required to be operated, the dependence on the mobile phone is further solved, manual control is required, objects cannot be recognized, the influence of ambient noise on voice recognition can be reduced only by a quiet environment or a close environment, real-time operation is required, and the technical problem of encumbrance is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:

fig. 1 is a schematic diagram of AR glasses directly connected to a cloud server according to an embodiment of the present application;

fig. 2 is a schematic diagram of AR glasses connected to a cloud server through a mobile phone according to an embodiment of the present application;

FIG. 3 is a comparison graph of sound reception effects of different orientations of the AR glasses microphone according to the embodiment of the present application;

FIG. 4 is a schematic diagram of the components of AR glasses according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a dual-microphone AR glasses layout according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a multi-microphone AR glasses layout according to an embodiment of the present application;

fig. 7 is a schematic diagram of directional speech enhancement of AR glasses according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In this application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings. These terms are used primarily to better describe the invention and its embodiments and are not intended to limit the indicated systems, elements or components to a particular orientation or to be constructed and operated in a particular orientation.

Moreover, some of the above terms may be used to indicate other meanings besides the orientation or positional relationship, for example, the term "on" may also be used to indicate some kind of attachment or connection relationship in some cases. The specific meanings of these terms in the present invention can be understood by those skilled in the art as appropriate.

Furthermore, the terms "mounted," "disposed," "provided," "connected," and "sleeved" are to be construed broadly. For example, it may be a fixed connection, a removable connection, or a unitary construction; can be a mechanical connection, or an electrical connection; may be directly connected, or indirectly connected through intervening media, or may be in communication between two systems, components or parts. The specific meanings of the above terms in the present invention can be understood by those of ordinary skill in the art according to specific situations.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

According to an embodiment of the present invention, as shown in fig. 1, there is provided a voice information processing method for AR glasses, the method including the steps of:

a microphone array formed by two or more microphones is used to receive sound, as shown in fig. 3, which specifically includes:

the directional acquisition of human voice and environmental sound is realized through a physical structure, namely, the sound in different directions is acquired by adjusting different directions of a microphone and matching with an external sound receiving structure, so that the acuteness of directional audio is improved, and the sound receiving structure comprises one or more of a specially designed cylindrical or conical structure and a sealing mechanism;

the human voice algorithm enhancement is realized by the beam forming technology, namely, the beam forming technology is used for forming directivity, the sound in the direction of the pointing axis is converted into the direction of the pointer, the interference noise in the environment is reduced, and the fixed noise remained in the direction of the pointing axis is reduced through the noise suppression function.

The voice of the dialogue person is identified through local algorithm operation or cloud computing, the voice of the dialogue person is translated into other languages, and the voice of the wearer is translated into the language or characters of the dialogue person.

And recognizing the voice data into text information through a voice processing algorithm and a voice recognition algorithm.

The data is sent to the cloud server through the wireless network connection module and information returned by the server is received, as shown in fig. 2, the data sending and receiving method further includes the step of connecting with other equipment, and the equipment is connected with the cloud server to send and receive the data.

And displaying the received returned processing information of the cloud server, including the results not limited to recognition and translation, on the AR glasses, and outputting the returned information of the cloud server by voice through one or more of a loudspeaker and an earphone.

In a further embodiment, the system further comprises working logic for executing arithmetic operation and controlling the whole system through the integrated operation unit.

From the above description, it can be seen that the present invention achieves the following technical effects:

in the embodiment of the application, the microphone array technology, the sound source positioning and the voiceprint recognition are adopted, the sound source positioning and noise reduction functions are realized through the beam forming technology, the directional acquisition of human voice and environmental sound is realized through the physical structure, the acuteness of directional audio is improved, the communication is more efficient, and the technical effect of operating a mobile phone is not needed.

It should be noted that the steps may be performed in a computer system such as a set of computer-executable instructions and, in some cases, the steps shown or described may be performed in a different order than presented herein.

According to an embodiment of the present invention, as shown in fig. 4, there is also provided AR glasses having a voice information processing system for recognizing a voice of a counterpart, the voice information processing system including:

the display module comprises 1 or two pieces of lens units for AR display, and also comprises one or more of but not limited to an indicator light and a display screen except for the AR display unit. The display module realizes display of the AR video signal and display of the working state of the AR glasses, such as battery power and the current working mode.

The transmission module realizes wireless network transmission and comprises one or more of Wi-Fi, Bluetooth and a mobile network.

The sensors include, but are not limited to, one or more of an RGB camera, a TOF camera, a lidar, a gyroscope, a gravity accelerometer, a geomagnetic sensor, a distance sensor, a speaker, in addition to a microphone.

The battery module comprises a battery and a power management unit, and power supply and charging and discharging management of the AR glasses are achieved.

The control module comprises a computing unit of the AR glasses, including but not limited to one or more of a CPU, a memory and a storage, and a user interaction control unit. The user interaction control unit comprises one or more of keys, a touch screen, a vibration sensor and a remote controller, and realizes the control of the working state of the AR glasses, such as the operations of switching on and off and setting modification.

In one embodiment, as shown in fig. 5, the AR glasses have only two microphones, the microphone 1 may be disposed at the left or right side temple position, and is oriented downward or inward, for receiving the voice information of the wearer, and the microphone 2 is located on the glasses frame, and is oriented forward, for receiving the voice of the interlocutor, and implements a noise reduction function through a beam forming algorithm, and recognizes the voice of the interlocutor and the wearer through a sound source localization function.

In another embodiment, as shown in fig. 6, the AR glasses have three or more microphones, wherein the microphone 1 is located on the side of the glasses and faces downward or the wearer for obtaining the voice information of the wearer, the microphone 2 is located on the glasses frame and faces forward and the interlocutor for receiving the voice information of the interlocutor, and the other microphones can face forward and outward for implementing the noise reduction function by the beam forming algorithm and identifying the voice of the interlocutor and the wearer by the sound source positioning function.

In another embodiment, as shown in fig. 7, the microphone array of the AR glasses can directionally enhance the audio frequency in a specific angle in front of the glasses wearer, and weaken the sound in other areas to ensure that the voice of the interlocutor can clearly hear the audio frequency in the directionally enhanced angle.

Through a plurality of times of experimental adjustment, the microphone array of the AR glasses has the best effect of audio directional enhancement in the angle of 45 degrees in front of the AR glasses wearer, and meanwhile, the sound in other areas can be weakened.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented in a general purpose computing system, centralized on a single computing system or distributed across a network of multiple computing systems, or alternatively implemented in program code that is executable by a computing system, such that the modules or steps may be stored in a memory system and executed by a computing system, fabricated separately as integrated circuit modules, or fabricated as a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A voice information processing method for AR glasses, comprising:

2. The voice information processing method for AR glasses according to claim 1, further comprising a working logic for performing arithmetic operations by the integrated operation unit and controlling the entire system.

3. The method as claimed in claim 1, wherein the sound reception by a microphone array consisting of two or more microphones comprises:

4. The speech information processing method for AR glasses according to claim 1, wherein the recognizing of the voice of the talker through a local arithmetic operation or a cloud computing further comprises:

translating the speaker's speech into other languages;

the wearer's voice is translated into the language or text of the conversant.

5. The method of claim 1, wherein the recognizing the voice data as text information further comprises a voice recognition algorithm.

6. The AR glasses according to claim 1, wherein the system for processing voice information for recognizing voice of the other party, the sending data to the cloud server and receiving information returned by the server further comprises:

7. AR glasses having a voice information processing system for recognizing a voice of a counterpart, the voice information processing system comprising:

the display module comprises 1 or two pieces of lens units displayed by AR;

the sensor comprises a microphone;

the battery module comprises a battery and a power supply management unit;

8. The AR glasses according to claim 7 having a voice information processing system for recognizing a voice of a counterpart, wherein the display module further comprises one or more of a display screen other than the AR display unit without being limited to an indicator lamp.

9. The AR glasses according to claim 7, wherein the sensors further comprise one or more of but not limited to RGB cameras, TOF cameras, lidar, gyroscopes, accelerometers, geomagnetic sensors, distance sensors, speakers.

10. The AR glasses according to claim 7, wherein the control module: