CN117095591A - Audio-visual assistance method, system, device, electronic equipment and storage medium - Google Patents

Audio-visual assistance method, system, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117095591A
CN117095591A CN202311054741.3A CN202311054741A CN117095591A CN 117095591 A CN117095591 A CN 117095591A CN 202311054741 A CN202311054741 A CN 202311054741A CN 117095591 A CN117095591 A CN 117095591A
Authority
CN
China
Prior art keywords
sound source
text information
voice
voice signal
environment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311054741.3A
Other languages
Chinese (zh)
Inventor
潘学殿
崔荣涛
李锦程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Iflytek Medical Technology Co ltd
Original Assignee
Iflytek Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iflytek Medical Technology Co ltd filed Critical Iflytek Medical Technology Co ltd
Priority to CN202311054741.3A priority Critical patent/CN117095591A/en
Publication of CN117095591A publication Critical patent/CN117095591A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B21/00Teaching, or communicating with, the blind, deaf or mute
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61FFILTERS IMPLANTABLE INTO BLOOD VESSELS; PROSTHESES; DEVICES PROVIDING PATENCY TO, OR PREVENTING COLLAPSING OF, TUBULAR STRUCTURES OF THE BODY, e.g. STENTS; ORTHOPAEDIC, NURSING OR CONTRACEPTIVE DEVICES; FOMENTATION; TREATMENT OR PROTECTION OF EYES OR EARS; BANDAGES, DRESSINGS OR ABSORBENT PADS; FIRST-AID KITS
    • A61F11/00Methods or devices for treatment of the ears or hearing sense; Non-electric hearing aids; Methods or devices for enabling ear patients to achieve auditory perception through physiological senses other than hearing sense; Protective devices for the ears, carried on the body or in the hand
    • A61F11/04Methods or devices for enabling ear patients to achieve auditory perception through physiological senses other than hearing sense, e.g. through the touch sense
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Biomedical Technology (AREA)
  • Acoustics & Sound (AREA)
  • Animal Behavior & Ethology (AREA)
  • Neurology (AREA)
  • Educational Technology (AREA)
  • Biophysics (AREA)
  • Otolaryngology (AREA)
  • Psychology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Vascular Medicine (AREA)
  • Physiology (AREA)
  • Public Health (AREA)
  • Veterinary Medicine (AREA)
  • Educational Administration (AREA)
  • Business, Economics & Management (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Geometry (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application discloses an audio-visual auxiliary method, a system, a device, electronic equipment and a storage medium, comprising the following steps: acquiring an image signal and a voice signal in the environment; converting the voice signal into text information; identifying the image signal, and determining the sound source coordinates of the voice signal by combining the voice signal; and displaying the text information based on the sound source coordinates. The method provided by the application can convert the voice signal into the text information and match with the image signal, so that the text information can be projected into the environment, the visual sense of a user can be stimulated, the user can concentrate attention more easily, and the speech understanding degree is improved.

Description

Audio-visual assistance method, system, device, electronic equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of audio-visual technology, in particular to an audio-visual auxiliary method, an audio-visual auxiliary system, an audio-visual auxiliary device, electronic equipment and a storage medium.
Background
The number of hearing impaired people in the world is now on the trend of increasing year by year. It is counted that about 5 hundred million people worldwide suffer from varying degrees of hearing impairment. In addition, more than half of the elderly aged 65 years old suffer from different levels of hearing impairment. Because of hearing impairment, they cannot hear the surrounding sounds normally, nor can they communicate effectively with others. This presents great inconvenience to their work, study and life. Current hearing devices can only amplify the processing of digital signals of sound, resulting in some words that may be distorted or otherwise ambiguous so that the hearing impaired group can hear the sound but not understand it.
Disclosure of Invention
The embodiment of the application provides an audio-visual auxiliary method, an audio-visual auxiliary system, an audio-visual auxiliary device, electronic equipment and a storage medium, which are used for solving the technical problem that a user cannot hear and understand semantics due to the fact that hearing-aid equipment in the prior art amplifies sound independently.
In order to solve the technical problems, the embodiment of the application discloses the following technical scheme:
in a first aspect, an audiovisual assistance method is provided, comprising:
acquiring an image signal and a voice signal in the environment;
converting the voice signal into text information;
identifying the image signal, and determining the sound source coordinates of the voice signal by combining the voice signal;
and displaying the text information based on the sound source coordinates.
With reference to the first aspect, the method for displaying the text information based on the sound source coordinates includes:
identifying the content of the voice signal and converting the content into text information;
according to the sound source coordinates, the text information is projected into the environment through an augmented reality tool and is close to the sound source coordinates, so that human eyes wearing the augmented reality tool can acquire the text information; wherein, the text information and the voice signal are in the same language or different languages.
In combination with the first aspect, the method for identifying the image signal and determining the sound source coordinates of the voice signal in combination with the voice signal includes;
acquiring face features and/or lip features in the image signals;
identifying a speaker, i.e. a sound source, by means of the face features and/or the lip features;
and establishing a two-dimensional or three-dimensional coordinate system in the image signal, and representing the position of the sound source through coordinates.
With reference to the first aspect, the method for displaying the text information based on the sound source coordinates further includes:
projecting a display area in the environment through an augmented reality tool, and projecting the text information into the display area;
the display area is brought into close proximity with the sound source coordinates to enable acquisition by a human eye wearing the augmented reality tool.
With reference to the first aspect, before the text information is displayed based on the sound source coordinates, the method further includes identifying a type of the sound source; the types of sound sources include real persons and non-real persons.
With reference to the first aspect, the method for displaying the text information based on the sound source coordinates further includes:
changing the shape and position of the display area according to the type of the sound source;
when the type of the sound source is a real person, the display area is a bubble type, and the display area is directed to the real person;
when the type of the sound source is a non-real person, the display area is rectangular, circular, elliptical or triangular.
With reference to the first aspect, the method further includes playing the voice signal in a speaker after noise reduction processing, and includes:
acquiring an image signal in the environment by an image pickup apparatus;
acquiring a voice signal in the environment through radio equipment;
wherein the image signal comprises video and/or successive pictures;
noise reduction processing is carried out on the acquired voice signals, noise parts in the environment are filtered, and voice parts in the voice signals are reserved;
and inputting the voice part into the loudspeaker for playing.
With reference to the first aspect, the playing mode includes amplifying playing and translating playing, the amplifying playing amplifies and plays the voice part, and the translating playing translates the content of the voice part into different languages and reads the content.
In a second aspect, there is provided an audiovisual assistance system, the system comprising:
the image acquisition module is used for acquiring image signals in the environment;
the voice acquisition module is used for acquiring voice signals in the environment;
the text conversion module is used for converting the voice signal into text information;
the projection module is used for projecting the text information to the environment for display;
the identification judging module is used for identifying the image signal and determining the sound source coordinates of the voice signal by combining the voice signal;
and the matching module is used for displaying the text information based on the sound source coordinates.
With reference to the second aspect, the projection module includes an augmented reality tool for projecting the text information into the environment for acquisition by a human eye wearing the augmented reality tool.
With reference to the second aspect, the device further comprises a voice processing module and a speaker module, wherein the voice processing module is used for performing noise reduction processing on the voice signal; the loudspeaker module is used for playing the voice signal.
In a third aspect, an audiovisual auxiliary device is provided, the device comprising a projection apparatus and a playback apparatus; the projection equipment comprises a lens and a bracket, wherein the bracket is arranged on two sides of the lens, and the bracket is connected with and fixes the lens; the projection equipment is provided with a camera, and the camera and the lens face in the same direction; the projection equipment is also provided with a first radio tool; the playing device is provided with a second sound receiving tool and a loudspeaker;
the projection device further comprises a display screen and a processor, wherein the display screen is arranged in the lens, and the processor is electrically connected with the camera, the first sound receiving tool, the second sound receiving tool and the loudspeaker respectively.
With reference to the third aspect, the projection device and the playing device are integrally configured, and the playing device is disposed at one end of the support away from the lens.
With reference to the third aspect, the projection device and the playing device are in a split structure, and the playing device and the projection device are connected by bluetooth.
In combination with the third aspect, the projection device and the playing device are in a split structure, the playing device is detachably connected to one end of the support, which is far away from the lens, and the playing device and the projection device are connected by adopting bluetooth or a wire.
With reference to the third aspect, the first radio tool includes a first microphone, a second microphone, and a third microphone, where the first microphone is disposed at an end of the bracket near the lens; the second microphone is arranged at the middle position of the bracket; the third microphone is arranged at one end of the bracket far away from the lens;
wherein the first microphone is oriented in the same direction as the camera; the second microphone faces to two sides of the projection device; the third microphone is oriented opposite to the first microphone.
With reference to the third aspect, the projection device includes AR glasses, and the playing device includes one or more of a hearing aid, a bluetooth headset, or a wired headset.
In a fourth aspect, there is provided an electronic device comprising a memory storing a computer program and a processor implementing the audiovisual assistance method of any one of the first aspects when the computer program is executed.
In a fifth aspect, there is provided a computer readable storage medium storing a computer program which when executed by a processor implements the audiovisual assistance method of any one of the first aspects.
One of the above technical solutions has the following advantages or beneficial effects:
compared with the prior art, the audio-visual auxiliary method comprises the following steps of: acquiring an image signal and a voice signal in the environment; converting the voice signal into text information; identifying the image signal, and determining the sound source coordinates of the voice signal by combining the voice signal; and displaying the text information based on the sound source coordinates. The method provided by the application can convert the voice signal into the text information and match with the image signal, so that the text information can be projected into the environment, the visual sense of a user can be stimulated, the user can concentrate attention more easily, and the speech understanding degree is improved.
In the present application there is provided an audiovisual assistance system, the system comprising: the system comprises: the image acquisition module is used for acquiring image signals in the environment; the voice acquisition module is used for acquiring voice signals in the environment; the text conversion module is used for converting the voice signal into text information; the projection module is used for projecting the text information to the environment for display; the recognition judging module is used for recognizing the image signals and determining the sound source coordinates of the voice signals by combining the voice signals; and the matching module displays text information based on the sound source coordinates. The system provided by the application can convert the voice signal into the text information and match with the image signal, so that the text information can be projected into the environment, the visual sense of a user can be stimulated, the user can concentrate attention more easily, and the speech understanding degree is improved.
In the present application, there is provided an audio-visual auxiliary apparatus comprising a projection device and a playback device; the projection equipment comprises a lens and a bracket, wherein the bracket is arranged on two sides of the lens, and the bracket is connected with and fixes the lens; the projection equipment is provided with a camera, and the camera and the lens face in the same direction; the projection equipment is also provided with a first radio tool; the playing device is provided with a second radio tool and a loudspeaker; the projection equipment further comprises a display screen and a processor, wherein the display screen is arranged in the lens, and the processor is electrically connected with the camera, the first sound receiving tool, the second sound receiving tool and the loudspeaker respectively. According to the device provided by the application, the camera can be used for acquiring the image in the environment, the first sound receiving tool and the second sound receiving tool can be used for acquiring the voice in the environment, and the processor is used for converting the voice into the characters to be matched with the image, so that the characters can be projected into the lens and displayed in the environment, the visual sense of a user can be stimulated, the user can concentrate on the visual sense more easily, and the speech understanding degree is improved.
Drawings
The technical solution and other advantageous effects of the present application will be made apparent by the following detailed description of the specific embodiments of the present application with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of an audio-visual auxiliary method according to an embodiment of the present application;
fig. 2 is a schematic diagram of module connection of an audio-visual auxiliary system according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an audio-visual auxiliary device according to an embodiment of the present application;
fig. 4 is a schematic side view of an audio-visual auxiliary device according to an embodiment of the present application.
Fig. 5 is a schematic structural diagram of an audio-visual auxiliary device according to other embodiments of the present application;
fig. 6 is a schematic structural diagram of an audio-visual auxiliary device according to still other embodiments of the present application.
The reference numerals are as follows:
110-bracket, 111-first microphone, 112-second microphone, 113-third microphone, 120-lens, 121-camera, 130-playing device, 140-processor.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. In the description of the present application, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present application and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more of the described features. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
The following examples illustrate embodiments of the application:
as shown in fig. 1, an embodiment of the present application provides an audio-visual assistance method, including:
s1: image signals and voice signals in the environment are acquired.
The specific method comprises the following steps: acquiring an image signal in an environment by an image pickup apparatus; when the video is shot, the video can be intercepted to obtain the picture, and objects and people in the current environment are identified from the picture; when continuous pictures are shot, the objects and the people in the current environment can be identified directly through the pictures;
and acquiring the voice signal in the environment through the radio equipment. In general, all sounds in the environment are recorded by the sound receiving device, and a voice signal is obtained after processing. In the present application, the image pickup apparatus may be one or more of a camera or a video camera, and the sound pickup apparatus includes a microphone.
S2: the speech signal is converted into text information.
The specific method comprises the following steps:
identifying the content of the voice signal and converting the content into text information; the text information and the voice signal can be in the same language or in different languages. It can be understood that the content of the person speaking in the voice can be identified by software such as semantic identification and the like, and the spoken content is converted into characters for display. It should be noted that the outputted text information may be obtained by translating the voice signal, that is, the voice signal may be converted into the same text information, and then the text information may be translated into text information of other languages. The method can assist two parties with different languages to communicate, greatly improves the communication capacity between people with audiovisual disorder, and enhances the communication confidence of people with audiovisual disorder.
S3: the image signals are identified, and the sound source coordinates of the voice signals are determined by combining the voice signals.
The specific method comprises the following steps of;
acquiring face features and/or lip features in the image signals; judging whether the person in the image signal sounds or not through the change of the face characteristics and/or the lip characteristics, namely determining a sound source; the time of the voice signal and the image signal is registered, so that a sound source in the image signal is identified according to the intensity of the voice signal. It can be understood that in order to better match text information with a speaker, when a person is talking, the image capturing device can acquire the face characteristics or the lip characteristics of the speaker, and the variation degree of the face and the lip is different because the speaking mode of each person is different, so that whether the acquired person speaks can be well judged through the face characteristics and the lip characteristics. In order to avoid the situation that a plurality of people wear the crown when speaking, the time of the acquired image signal and the time of the voice signal are matched, so that which person is the speaker can be judged according to the principle of sound-picture synchronization. When talking with multiple people, the distance between the speaker and each person is different, and the distance between the speaker and the voice can be judged by the intensity of the voice, so that the voice and the picture can be well synchronized. After the sound source is obtained, a two-dimensional or three-dimensional coordinate system is constructed in the image signal, and in general, a fixed point may be selected in the image signal or the coordinate system may be constructed with the center of the projected screen as a reference point. The coordinates of the sound source can be obtained through the coordinate system, and the sound source can be represented through the coordinates.
S4: and displaying the text information based on the sound source coordinates.
The specific method comprises the following steps:
projecting a display area in the environment through an augmented reality tool, and projecting text information into the display area; the display area is brought into close proximity with the sound source to enable acquisition by a human eye wearing the augmented reality tool. It can be appreciated that when the text information is projected into the environment, in order to facilitate the user to read the text information, the text information can be projected into a specific display area, and the text is concentrated and displayed through the display area, so that the user can quickly read the displayed text content. And after the sound source is identified, i.e. after the speaker, the display area may be brought close to the speaker, thereby facilitating reading of the speaking content while looking at the speaker. It should be noted that the display area is generally projected in the environment and is close to the lips or head position of the speaker. In some cases, the display area does not overlap the body part of the speaker unless the user is in close proximity to the speaker, avoiding unsightly behavior when talking to the person.
In an embodiment of the present application, the audio-visual auxiliary method further includes performing noise reduction processing on the voice signal, and playing the voice signal in a speaker.
The specific method comprises the following steps:
noise in the environment is filtered after the acquired voice signal is subjected to noise reduction treatment, and the human voice part in the voice signal is reserved; the human voice part is input into a loudspeaker for playing. It will be appreciated that in most scenarios of life, there is much noise, both indoors and outdoors, which is natural and artificial. When the sound receiving equipment acquires the sound, the sound receiving equipment cannot distinguish, so that the sound required by the sound receiving equipment can be acquired purposefully, and the sound can be screened after all the sound is acquired. The noise reduction process is a process of filtering a noise portion to thereby highlight a human voice portion when a sound is acquired. The human voice after the filtering treatment can be directly input to a loudspeaker for playing, and the human voice can be acquired and received by human ears after playing.
In the embodiment of the application, the playing mode comprises amplifying playing and translating playing, wherein the amplifying playing amplifies and plays the voice of the voice part, and the translating playing translates the content of the voice part into different languages and reads the content. It can be understood that the amplifying and playing is to amplify the voice of the people after the noise reduction treatment, and by the method, the hearing impaired people can be assisted to hear the conversation between the people and themselves, or can hear the voice played by the television when listening to the television program. And the translation playing is to translate the noise-reduced voice part and read out the translated content. Generally, some voice broadcasting software or artificial intelligence is built in the device, and after obtaining the content to be broadcasted, the content can be directly broadcasted, and the function can be realized in a mobile phone navigation mode. By the method, people with different languages can be assisted and people who speak the language can be more clearly understood. It should be noted that the amplifying and translating may be performed simultaneously, that is, the translated content is amplified and played after translation is completed, so as to help more people communicate with other people and avoid ambiguity during communication.
In the embodiment of the application, before the text information is displayed based on the sound source coordinates, the method further comprises the step of identifying the type of the sound source; the types of sound sources include real persons and non-real persons. Specifically, in daily life, not only communication with people is performed, but also many entertainment activities such as watching television, watching mobile phone, listening to radio and the like are performed. Under different scenes, such as the scenes when communicating with a person and watching television, the scene is different, so that the current scene needs to be confirmed by shooting and acquiring corresponding pictures through the image pickup equipment, and different treatments are made.
In the embodiment of the application, the shape and the position of the display area are changed according to the type of the sound source; when the type of the sound source is a real person, the display area is a bubble type, and the display area points to the real person; when the type of sound source is a non-genuine person, the display area is rectangular, circular, elliptical or triangular. It will be appreciated that when communicating with a real person, not only is real-time interaction required, but both parties also need to be respected to each other. Therefore, when communicating, the user can preferably directly look at the eyes or the face of the other party, so that when communicating with a real person, the display area can be projected at the position of the speaker close to the head, and when the number of the speakers is large, the display area can be pointed to the speaker, so that the user can conveniently and intuitively see the contents spoken by other speakers. When entertainment is carried out, the display area can be close to a sound source, such as a television, after receiving relevant voice information, and can be arranged at a position convenient to see, such as the bottom of a display screen of the television or the top of a radio, and when a user does not watch the sound source, the display area can be projected to a blank position or a central position, so that the user can see text contents conveniently.
As shown in fig. 2, an embodiment of the present application further provides an audio-visual auxiliary system, including:
the image acquisition module is used for acquiring image signals in the environment; the image acquisition module can be a camera, and the image signals in the environment are shot and acquired through the camera; wherein the image signal comprises video and/or successive pictures.
The voice acquisition module is used for acquiring voice signals in the environment; the voice acquisition module comprises a microphone, and ambient sound surrounding the user is received by the microphone for processing.
The voice processing module is used for carrying out noise reduction processing on the voice signals; and carrying out noise reduction processing on the acquired voice signals, filtering noise parts in the environment, and retaining human voice parts in the voice signals.
The loudspeaker module is used for playing the voice signals; the playing mode comprises amplifying playing and translating playing, wherein the amplifying playing amplifies and plays the voice of the voice part, and the translating playing translates the content of the voice part into different languages and reads the content.
The text conversion module is used for converting the voice signal into text information; after the content of the voice information is identified, the content is converted into text information.
The projection module is used for projecting the text information to the environment for display; the projection module comprises an augmented reality tool, and the text information is projected into the environment through the augmented reality tool so as to be acquired by human eyes wearing the augmented reality tool, wherein the text information and the voice signal can be in the same language or different languages.
The recognition judging module is used for recognizing the image signals and acquiring sound sources of the voice signals; acquiring face features and/or lip features in the image signals; judging whether the person in the image signal sounds or not through the change of the face characteristics and/or the lip characteristics; the time of the voice signal and the image signal is registered, so that a sound source in the image signal is identified according to the intensity of the voice signal.
The matching module is used for matching the text information with a sound source in the environment; projecting a display area in the environment through an augmented reality tool, and projecting text information into the display area; the display area is brought into close proximity with the sound source to enable acquisition by a human eye wearing the augmented reality tool.
As shown in fig. 3 to 4, the embodiment of the present application further provides an audio-visual auxiliary apparatus, which includes a projection device and a playing device 130; the projection device comprises a lens 120 and a bracket 110, wherein the bracket 110 is arranged at two sides of the lens 120, and the bracket 110 is connected with and fixes the lens 120; the projection equipment is provided with a camera 121, and the camera 121 and the lens 120 face in the same direction; the projection equipment is also provided with a first radio tool; the playing device 130 is provided with a second sound receiving means and a loudspeaker; the projection device further includes a display screen and a processor 140, the display screen is disposed in the lens 120, and the processor 140 is electrically connected to the camera 121, the first sound receiving tool, the second sound receiving tool, and the speaker, respectively. It can be understood that the audio-visual auxiliary device can implement the above-mentioned audio-visual auxiliary method, wherein the camera 121 is provided with at least two joints between the lens 120 and the bracket 110, the camera 121 can be used for acquiring the image signals in the above-mentioned method, the first sound receiving tool and the second sound receiving tool are used for acquiring the voice signals in the above-mentioned method, the processor 140 is used for noise reduction processing of voice, word conversion processing, recognition processing of image and matching processing of image and word, the speaker is used for playing the voice signals, and the display screen is used for projecting word information. The user can hear the surrounding voice more clearly by wearing the device, and can see the projected characters in the environment when talking with the person, thereby being more convenient for people with hearing impairment to communicate with the person.
As shown in fig. 3 and 4, in the embodiment of the present application, the projection device and the playing device 130 are integrally configured, and the playing device is disposed at an end of the support 110 away from the lens 120. It will be appreciated that the projection device is mainly used for capturing image signals and voice signals and for projecting text information, and the playback device 130 is mainly used for capturing voice signals and amplifying and playing out voice signals. The placement of the playback device 130 at an end remote from the lens 120 may facilitate the user's wearing in the vicinity of the ear to hear the sound emitted in the playback device 130. Meanwhile, the projection equipment and the playing equipment 130 which are integrally arranged do not need to be stored separately, so that the situation that the devices cannot be matched for use after one of the projection equipment and the playing equipment is lost can be avoided.
In some other embodiments of the present application, as shown in fig. 5, the projection device and the playing device 130 are in a separate structure, and the playing device 130 and the projection device are connected by bluetooth. It will be appreciated that, in order to facilitate different functions of the projection device and the playback device 130, respectively, the two may cooperate after a bluetooth connection, and may be used separately. So that both devices can be connected and adapted as long as they can simultaneously fulfil the above functions. Thereby increasing the possibility of connection between devices, e.g. two devices may be worn on different persons for projection and speech output, respectively. Alternatively, a plurality of playing devices 130 may be used to make bluetooth connection with the projection device, so as to implement voice output for multiple persons.
In some other embodiments of the present application, as shown in fig. 6, the projection device and the playing device 130 are in a split structure, the playing device 130 is detachably connected to the end of the support 110 far from the lens 120, and the playing device 130 is connected to the projection device by bluetooth or a wire. It will be appreciated that, in order to facilitate the use of the projection device and the playback device 130 in different scenarios, the two are detachably connected, so that they can be used together for text projection and voice playback after connection; after separation, the two can be used independently, thereby meeting the requirements of different scenes and different users.
As shown in fig. 3 to 6, in the embodiment of the present application, the first sound receiving tool includes a first microphone 111, a second microphone 112 and a third microphone 113, where the first microphone 111 is disposed at an end of the support 110 near the lens 120; the second microphone 112 is arranged at the middle position of the bracket 110; the third microphone 113 is disposed at an end of the bracket 110 away from the lens 120; wherein, the orientation of the first microphone 111 is the same as the camera 121; the second microphone 112 is directed to both sides of the projection device; the third microphone 113 is oriented opposite to the first microphone 111. It will be appreciated that in performing sound source identification, the speaker's position needs to be determined according to the distance of the sound, so that the sound size can be better determined as the number of the receiving tools is increased. Therefore, the first microphone 111 is disposed at an end of the bracket 110 near the lens 120 and faces the same direction as the camera 121, i.e., forward. The first microphone 111 can better receive sound from the front, and the first microphone 111 is provided at both ends of the bracket 110. The second microphone 112 is disposed at a middle position of the stand 110 and faces both sides for collecting sounds from both sides of the user. And a third microphone 113 is provided at an end of the holder 110 remote from the lens 120 for collecting sound from the rear of the user. Meanwhile, the playing device 130 also has a second sound receiving tool, so that the user can better judge the sound emitting direction through the three-square eight microphones, and the camera 121 can be assisted to capture the sound source more easily, and the projection of the text matching sound source on the display screen is also facilitated.
In an embodiment of the application, the projection device comprises AR glasses and the playback device 130 comprises one or more of a hearing aid, a bluetooth headset or a wired headset. It will be appreciated that AR glasses already include a lens 120, a cradle 110 and a display screen, and that some content may be projected into reality, while both the hearing aid and the bluetooth headset have bluetooth connectivity and voice playing functions. The application combines the functions of the device and adds some feasible schemes to assist people with visual impairment or old people to be able to cope with the scenes of communication or entertainment with people in life, thereby improving and enriching the quality of life of the people.
The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the audio-visual auxiliary method provided by any one of the above when executing the computer program. By the method, people with visual and audio disabilities can be assisted to deal with the scenes of communication or entertainment with people in life, so that the quality of life of the people with visual and audio disabilities is improved and enriched.
The embodiment of the application also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the audio-visual assistance method provided by any one of the above. The method provided by the application can amplify and broadcast the voice signal in the environment after noise reduction, improves the definition of voice broadcast, and simultaneously converts the voice signal into the text information to be matched with the image signal, so that the text information can be projected into the environment, and can stimulate the vision and the hearing sense of a user at the same time, so that the user can concentrate attention more easily, and the speech understanding degree is improved.
The foregoing describes in detail an audio-visual assistance method, system, apparatus, electronic device and storage medium provided by the embodiments of the present application, and specific examples are applied to illustrate the principles and embodiments of the present application, where the foregoing description of the embodiments is only for helping to understand the technical solution and core ideas of the present application; those of ordinary skill in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims (19)

1. An audiovisual assistance method, comprising:
acquiring an image signal and a voice signal in the environment;
converting the voice signal into text information;
identifying the image signal, and determining the sound source coordinates of the voice signal by combining the voice signal;
and displaying the text information based on the sound source coordinates.
2. The audiovisual assistance method according to claim 1, wherein the method of displaying the text information based on the sound source coordinates comprises:
identifying the content of the voice signal and converting the content into text information;
according to the sound source coordinates, the text information is projected into the environment through an augmented reality tool and is close to the sound source coordinates, so that human eyes wearing the augmented reality tool can acquire the text information; wherein, the text information and the voice signal are in the same language or different languages.
3. The audiovisual assistance method of claim 1, wherein the identifying the image signal, and the determining the sound source coordinates of the voice signal in combination with the voice signal, comprises;
acquiring face features and/or lip features in the image signals;
identifying a speaker, i.e. a sound source, by means of the face features and/or the lip features;
and establishing a two-dimensional or three-dimensional coordinate system in the image signal, and representing the position of the sound source through coordinates.
4. The audiovisual assistance method of claim 1, wherein the method of displaying the text information based on the sound source coordinates further comprises:
projecting a display area in the environment through an augmented reality tool, and projecting the text information into the display area;
the display area is brought into close proximity with the sound source coordinates to enable acquisition by a human eye wearing the augmented reality tool.
5. The audiovisual assistance method of claim 4, wherein the displaying the text information based on the sound source coordinates further comprises identifying a type of the sound source; the types of sound sources include real persons and non-real persons.
6. The audiovisual assistance method of claim 5, wherein the method of displaying the text information based on the sound source coordinates further comprises:
changing the shape and position of the display area according to the type of the sound source;
when the type of the sound source is a real person, the display area is a bubble type, and the display area is directed to the real person;
when the type of the sound source is a non-real person, the display area is rectangular, circular, elliptical or triangular.
7. The audiovisual assistance method according to claim 1, wherein the method further comprises playing the speech signal in a speaker after noise reduction processing, the method comprising:
acquiring an image signal in the environment by an image pickup apparatus;
acquiring a voice signal in the environment through radio equipment;
wherein the image signal comprises video and/or successive pictures;
noise reduction processing is carried out on the acquired voice signals, noise parts in the environment are filtered, and voice parts in the voice signals are reserved;
and inputting the voice part into the loudspeaker for playing.
8. The audiovisual assistance method according to claim 7, wherein the playing mode includes an enlarged playing mode for amplifying and playing the voice portion, and a translated playing mode for translating and reading the content of the voice portion into different languages.
9. An audiovisual assistance system, the system comprising:
the image acquisition module is used for acquiring image signals in the environment;
the voice acquisition module is used for acquiring voice signals in the environment;
the text conversion module is used for converting the voice signal into text information;
the projection module is used for projecting the text information to the environment for display;
the identification judging module is used for identifying the image signal and determining the sound source coordinates of the voice signal by combining the voice signal;
and the matching module is used for displaying the text information based on the sound source coordinates.
10. The audiovisual assistance system of claim 9, wherein the projection module comprises an augmented reality tool for projecting the textual information into the environment for acquisition by a human eye wearing the augmented reality tool.
11. The audio-visual auxiliary system of claim 9, further comprising a speech processing module and a speaker module, said speech processing module for noise reduction of said speech signal; the loudspeaker module is used for playing the voice signal.
12. An audiovisual auxiliary device, characterized in that the device comprises a projection apparatus and a playback apparatus (130); the projection equipment comprises a lens (120) and a bracket (110), wherein the bracket (110) is arranged on two sides of the lens (120), and the bracket (110) is connected with and fixes the lens (120); the projection equipment is provided with a camera (121), and the camera (121) and the lens (120) face in the same direction; the projection equipment is also provided with a first radio tool; the playing device (130) is provided with a second radio tool and a loudspeaker;
the projection device further comprises a display screen and a processor (140), wherein the display screen is arranged in the lens (120), and the processor (140) is electrically connected with the camera (121), the first sound receiving tool, the second sound receiving tool and the loudspeaker respectively.
13. The audiovisual aid of claim 12, characterized in that the projection device is of unitary construction with the playback device (130), which is arranged at the end of the holder (110) remote from the lens (120).
14. The audiovisual auxiliary device according to claim 12, characterized in that the projection device and the playback device (130) are of a separate structure, and the playback device (130) and the projection device are connected by bluetooth.
15. The audio-visual auxiliary apparatus according to claim 12, wherein the projection device and the playing device (130) are in a split structure, the playing device (130) is detachably connected to an end of the support (110) far away from the lens (120), and the playing device (130) is connected to the projection device by bluetooth or a wire.
16. The audio-visual auxiliary device of claim 12, wherein said first audio-reception means comprises a first microphone (111), a second microphone (112) and a third microphone (113), said first microphone (111) being provided at an end of said holder (110) close to said lens (120); the second microphone (112) is arranged at the middle position of the bracket (110); the third microphone (113) is arranged at one end of the bracket (110) far away from the lens (120);
wherein the first microphone (111) is oriented in the same direction as the camera (121); the second microphone (112) is directed towards both sides of the projection device; the third microphone (113) is oriented opposite to the first microphone (111).
17. The audiovisual assistance device of claim 12, wherein the projection apparatus comprises AR glasses and the playback apparatus (130) comprises one or more of a hearing aid, a bluetooth headset or a wired headset.
18. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the audiovisual assistance method according to any one of claims 1 to 8 when executing the computer program.
19. A computer readable storage medium storing a computer program, which when executed by a processor implements the audiovisual assistance method according to any one of claims 1 to 8.
CN202311054741.3A 2023-08-21 2023-08-21 Audio-visual assistance method, system, device, electronic equipment and storage medium Pending CN117095591A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311054741.3A CN117095591A (en) 2023-08-21 2023-08-21 Audio-visual assistance method, system, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311054741.3A CN117095591A (en) 2023-08-21 2023-08-21 Audio-visual assistance method, system, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117095591A true CN117095591A (en) 2023-11-21

Family

ID=88771095

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311054741.3A Pending CN117095591A (en) 2023-08-21 2023-08-21 Audio-visual assistance method, system, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117095591A (en)

Similar Documents

Publication Publication Date Title
CN109446876B (en) Sign language information processing method and device, electronic equipment and readable storage medium
JP4439740B2 (en) Voice conversion apparatus and method
JP6017854B2 (en) Information processing apparatus, information processing system, information processing method, and information processing program
US9949056B2 (en) Method and apparatus for presenting to a user of a wearable apparatus additional information related to an audio scene
JP7347597B2 (en) Video editing device, video editing method and program
US20150036856A1 (en) Integration of hearing aids with smart glasses to improve intelligibility in noise
US10771694B1 (en) Conference terminal and conference system
WO2019000721A1 (en) Video file recording method, audio file recording method, and mobile terminal
KR101861590B1 (en) Apparatus and method for generating three-dimension data in portable terminal
US20150039288A1 (en) Integrated oral translator with incorporated speaker recognition
CN111343554A (en) Hearing aid method and system combining vision and voice
WO2015090182A1 (en) Multi-information synchronization code learning device and method
US20170186431A1 (en) Speech to Text Prosthetic Hearing Aid
JP2016091057A (en) Electronic device
CN114422935A (en) Audio processing method, terminal and computer readable storage medium
US20180167745A1 (en) A head mounted audio acquisition module
CN117095591A (en) Audio-visual assistance method, system, device, electronic equipment and storage medium
CN112764549B (en) Translation method, translation device, translation medium and near-to-eye display equipment
US20230238001A1 (en) Eyeglass augmented reality speech to text device and method
JP4772315B2 (en) Information conversion apparatus, information conversion method, communication apparatus, and communication method
US11412178B2 (en) Information processing device, information processing method, and program
US20210174823A1 (en) System for and Method of Converting Spoken Words and Audio Cues into Spatially Accurate Caption Text for Augmented Reality Glasses
CN111640448A (en) Audio-visual auxiliary method and system based on voice enhancement
CN112751582A (en) Wearable device for interaction, interaction method and equipment, and storage medium
EP4280211A1 (en) Sound signal processing method and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination