CN114792393A - Visual assistance method, device and computer readable storage medium - Google Patents

Visual assistance method, device and computer readable storage medium Download PDF

Info

Publication number
CN114792393A
CN114792393A CN202110106026.4A CN202110106026A CN114792393A CN 114792393 A CN114792393 A CN 114792393A CN 202110106026 A CN202110106026 A CN 202110106026A CN 114792393 A CN114792393 A CN 114792393A
Authority
CN
China
Prior art keywords
information
image
text
video
inputting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110106026.4A
Other languages
Chinese (zh)
Inventor
屈杨森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TCL Technology Group Co Ltd
Original Assignee
TCL Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TCL Technology Group Co Ltd filed Critical TCL Technology Group Co Ltd
Priority to CN202110106026.4A priority Critical patent/CN114792393A/en
Publication of CN114792393A publication Critical patent/CN114792393A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61HPHYSICAL THERAPY APPARATUS, e.g. DEVICES FOR LOCATING OR STIMULATING REFLEX POINTS IN THE BODY; ARTIFICIAL RESPIRATION; MASSAGE; BATHING DEVICES FOR SPECIAL THERAPEUTIC OR HYGIENIC PURPOSES OR SPECIFIC PARTS OF THE BODY
    • A61H3/00Appliances for aiding patients or disabled persons to walk about
    • A61H3/06Walking aids for blind persons
    • A61H3/061Walking aids for blind persons with electronic detecting or guiding means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61HPHYSICAL THERAPY APPARATUS, e.g. DEVICES FOR LOCATING OR STIMULATING REFLEX POINTS IN THE BODY; ARTIFICIAL RESPIRATION; MASSAGE; BATHING DEVICES FOR SPECIAL THERAPEUTIC OR HYGIENIC PURPOSES OR SPECIFIC PARTS OF THE BODY
    • A61H2201/00Characteristics of apparatus not provided for in the preceding codes
    • A61H2201/50Control means thereof
    • A61H2201/5023Interfaces to the user

Abstract

The application is applicable to the technical field of computers, and provides a visual assistance method, a visual assistance device and a computer-readable storage medium, wherein the visual assistance method comprises the following steps: acquiring target scene data; inputting target scene data into a preset scene description model to obtain scene description information for describing a target scene; and generating voice information according to the scene description information. The vision assistance method provided by the application is high in intelligence.

Description

Visual assistance method, device and computer readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a visual assistance method and apparatus, and a computer-readable storage medium.
Background
The visual disturbance means that the structure or function of the visual organs is partially or completely disturbed, and the external things can not be or are difficult to be identified. Visually impaired people often need visual aids to assist in completing related activities in daily life, such as walking.
There is a visual aid in the conventional art, which mainly includes an ultrasonic transducer and an earphone. The ultrasonic transducer transmits ultrasonic pulse waves forwards and receives the reflected ultrasonic pulse waves, the reflected ultrasonic pulse waves are transmitted to the earphone, and a person with visual impairment can sense an obstacle in front through the change of sound in the earphone.
However, such a visual aid has a problem of poor intelligence.
Disclosure of Invention
The embodiment of the application provides a visual auxiliary method, a visual auxiliary device and a computer readable storage medium, which can solve the problem of poor intelligence of the visual auxiliary device in the prior art.
In a first aspect, an embodiment of the present application provides a visual assistance method, including:
acquiring target scene data;
inputting target scene data into a preset scene description model to obtain scene description information for describing a target scene;
and generating voice information according to the scene description information.
In a second aspect, an embodiment of the present application provides a visual assistance device, including:
the acquisition module is used for acquiring target scene data;
the description module is used for inputting the target scene data into a preset scene description model to obtain scene description information for describing a target scene;
and the voice module is used for generating voice information according to the scene description information.
In a third aspect, an embodiment of the present application provides a visual assistance device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing any of the visual aid methods of the first aspect described above when the computer program is executed by the processor.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and when executed by a processor, the computer program implements any one of the visual assistance methods in the first aspect.
According to the visual assistance method, the visual assistance device and the computer-readable storage medium, the target scene data are acquired and input into the preset scene description model to obtain the scene description information for describing the target scene, and the voice information is further generated according to the scene description information. The method provided by the embodiment can automatically identify and describe the objects in the target scene, does not need a user to judge obstacles and the like existing around, and is high in intelligence and strong in practicability.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a schematic view of a visual aid according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of a video auxiliary method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a video auxiliary method according to another embodiment of the present application;
FIG. 4 is a flowchart illustrating a video assistance method according to another embodiment of the present application;
FIG. 5 is a schematic structural diagram of an image description model provided in an embodiment of the present application;
FIG. 6 is a diagram illustrating an initial image description model structure and training process according to an embodiment of the present disclosure;
FIG. 7 is a schematic structural diagram of a video description model provided in an embodiment of the present application;
fig. 8 is a flowchart illustrating a video assistance method according to an embodiment of the present application;
FIG. 9 is a schematic structural diagram of a feature fusion model provided in an embodiment of the present application;
FIG. 10 is a schematic structural diagram of a text-to-speech conversion model according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a video auxiliary device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather mean "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
The following describes the visual assistance method, apparatus, and computer-readable storage medium provided by the present application in detail with reference to the following embodiments:
visually handicapped people, i.e., blind people, are mainly classified into full-blind people and semi-blind people. The movement of the visually impaired person is restricted due to the visual impairment. The visual aid can assist the life of the visually impaired, and the visual aid includes but is not limited to reading aid, walking aid and the like. The visual aid in the conventional art requires the visually impaired to judge the orientation and approximate distance of the obstacle by listening to the change of sound with the ear. By adopting the method, the visually impaired people not only need to judge by themselves, but also cannot know what the obstacle is specifically, and the intelligence is poor. The visual assistance method, device and computer-readable storage medium provided by the embodiments of the present application aim to solve the problem.
The visual assistance method provided by the embodiment of the application can be applied to a visual assistance device or a terminal device, and fig. 1 is a schematic structural diagram of the visual assistance device provided by an embodiment. As shown in fig. 1, the visual aid 10 may include an image capturing device 101, a memory 102, a processor 103, an input unit 104, and an audio device 105. The image capture device 101 may be used to capture images or video. The memory 102 may be used to store software programs as well as data. The processor 103 executes various functional applications of the visual assistance apparatus 10 and data processing by running software programs stored in the memory 102. Optionally, the memory 102 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The input unit 104 may be used to receive instructions input by a user. Specifically, the input unit 104 may include, but is not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, a touch panel, and the like. The input unit 104 may also be a voice input device. The voice inputter may include a microphone for collecting voice data and a voice command recognizer for recognizing a voice command associated with the collected voice data.
The audio device 105 may include an audio circuit and a speaker, microphone, etc. connected to the audio circuit.
The embodiment of the present application does not limit the type and name of the visual aid or the terminal device. For example, it may be a mobile phone, a tablet computer, a wearable device (which may be a watch, a headset, glasses, a brooch, etc.), a cane, an umbrella, etc.
The following embodiments may be implemented on the visual aid 10 having the above-described hardware/software structure. The following embodiments will describe the visual assistance method provided in the embodiments of the present application by taking the visual assistance device 10 as an example.
Fig. 2 shows a schematic flow diagram of a visual assistance method provided by the present application, the method comprising:
s201, obtaining target scene data.
The target scene refers to the scene, the scene or the environment where the current user is located. The object scene data refers to data including object scene content. In this embodiment, the target scene data may be image data, video data, or the like. Alternatively, the target scene data may be stored in a memory of the visual aid in advance and retrieved directly when used. Optionally, the target scene data may also be obtained by inputting a relevant instruction by the user and performing data acquisition on the target scene by the visual aid device when needed. Of course, the target scene data may also be obtained from a server or other devices.
S202, inputting the target scene data into a preset scene description model to obtain scene description information for describing the target scene.
The scene description model is used for describing the target scene. The input of the scene description model is target scene data, and the output is scene description information. Objects such as persons, articles and buildings included in the scene, and the association relationship or the position relationship between the objects and the like can be included in the scene description information; the scene description information may further include characteristic information of persons, articles, buildings, and the like, for example, sex, wearing, posture, and the like of the persons; shape, color, size, material, etc. of the article; height, shape, orientation, etc. of the building. The scene description information may further include information on dynamic objects, such as the motion state of a person, the moving direction of a moving object, and the like. The scene description information may be, but is not limited to, text information or data information, etc. The specific structure of the scene description model is not limited at all, and can be designed and selected according to actual requirements.
And S203, generating voice information according to the scene description information.
The scene description information describing the target scene is converted into voice information, and the voice information can be further played to the user. The user can clearly know the situation of the target scene according to the voice information, and the visual handicapped person can use the voice information conveniently.
In this embodiment, the target scene data is input into the preset scene description model by acquiring the target scene data, so as to obtain the scene description information for describing the target scene, and further generate the voice information according to the scene description information. According to the method, the user can automatically identify and describe the objects in the target scene, the user does not need to judge the obstacles and the like existing around, the intelligence is high, and the practicability is high.
In one embodiment, the step S203 of generating the voice information according to the scene description information includes:
generating navigation information according to the scene description information; and generating voice information according to the navigation information.
Specifically, the navigation information refers to information that guides the user in the travel route. As described above, the scene description information includes the people, the articles, the buildings, and the like included in the scene, and the information of the people, the articles, the buildings, and the like included in the scene description information can be screened out, and the navigation information can be further determined according to the information. The navigation information is converted into the voice information to be played to the user, so that the user can be guided to safely move around the barrier, the intelligence of visual assistance is further improved, and the safety of the user is improved.
Referring to fig. 3, in an embodiment, the step of generating navigation information according to the scene description information may be implemented by:
s301, determining the information of the object matched with the information of the preset reference object in the target scene according to the scene description information to obtain the information of the target reference object.
The preset reference object refers to a fixed-position object which is selected in advance and used as a navigation reference. The object can be a building, furniture, household electrical appliance and the like. The information of the preset reference object can include the shape, color, included features and the like of the reference object under various viewing angles. A list of pre-set references and corresponding information may be pre-established. After the scene description information is obtained, the specific information of the object screened in the target scene can be matched with the information in the list, a reference object matched with the screened object information is determined, a target reference object is obtained, the information of the target reference object is obtained, and the information of the target reference object is obtained.
For example, taking a family scene as an example, a wall, a door, a dining table, a sofa, a television, a refrigerator and the like in a family can be taken as preset references in advance, and information of the preset references is entered. And the visual auxiliary device screens out object information in the scene description information according to the obtained scene description information and matches the object information with preset reference object information. For example, the scene description information includes a red dining table, two chairs arranged on each of the left and right sides of the dining table, and a refrigerator … … arranged on the left side of the dining table, and matches the information with information of a preset reference object to obtain information of the target reference object, the dining table, the chairs, the refrigerator, and the target reference object.
And S302, determining the position information of the target reference object according to the information of the target reference object.
The position information of the target reference object may be stored in advance in the form of coordinates or the like, and has a unique correspondence with the target reference object.
S303, obtaining the user position information.
Optionally, the visual assistance device may be provided with a positioning module for positioning the current position in real time. Further, the position information of the user can be estimated from the relative position relationship with the user when the visual positioning device is used.
S304, determining the traveling direction information according to the position information of the target reference object and the user position information.
Because the data collected by the visual aid device is generally data obtained when the user faces the target scene when the user uses the visual aid device, the target scene data can represent the situation of things existing in the positive direction of the current position (i.e., in front of the user). Therefore, the direction of the user can be determined according to the position information of the target reference object and the position information of the user, and the traveling direction information of the user can be further determined. Alternatively, the direction information of the user's travel may be represented by, for example, forward, backward, leftward or rightward, etc.
S305, generating navigation information according to the traveling direction information.
Further, according to the traveling direction and the characters, objects or buildings included in the target scene, navigation information is determined so that the user can get around obstacles.
It should be noted that, the navigation information in the conventional technology cannot acquire the direction of the user, and only the current traveling direction of the user can be determined according to the moving direction of the user. For visually impaired people, safety of each step of travel needs to be ensured. The method provided by the embodiment determines the information of the target reference object according to the scene description information, and determines the position information of the target reference object. The direction of the user can be determined according to the position information of the target reference object and the position information of the user, and then the traveling direction information of the user can be accurately determined according to the direction of the user. Therefore, the navigation information determined according to the user traveling direction information is safer and more accurate.
The following describes a specific process for acquiring target scene data in detail with reference to the embodiments.
Optionally, the target scene data may include at least one of a target scene image and a target scene video. Referring to fig. 4, in an embodiment, the step S201 of acquiring the target scene data may include:
s401, a visual auxiliary instruction generated by a user executing visual auxiliary operation is obtained, and the visual auxiliary instruction is an image acquisition instruction or a video acquisition instruction.
The visual assistance operation is an operation of the visual assistance device by the user. The visual assistance operation may be performed by an input unit of the visual assistance apparatus, for example, the visual assistance operation may be performed by a physical keyboard, a function key, a trackball, a mouse, a joystick, or the like. The user performs a visual assistance operation to cause the visual assistance device to generate a corresponding visual assistance instruction. The visual aid instructions include one of image capture instructions or video capture instructions. The image acquisition instruction is used for indicating the visual auxiliary device to acquire images, and the video acquisition instruction is used for indicating the visual auxiliary device to acquire videos. Specifically, the visual auxiliary operation and the visual auxiliary instruction have a unique corresponding relationship, when a user performs a certain preset visual auxiliary operation, an image acquisition instruction is generated, and when the user performs another preset visual auxiliary operation, a video acquisition instruction is generated. The specific content of the visual auxiliary operation and the corresponding relationship between the visual auxiliary operation and the visual auxiliary command can be set according to actual requirements, which is not limited in this application.
And S402, if the visual auxiliary instruction is an image acquisition instruction, starting image acquisition equipment to shoot images to obtain a target scene image.
When the instruction received by the processor of the visual auxiliary device is an image acquisition instruction, the image acquisition equipment is controlled to start an image shooting function, and the surrounding environment is shot to obtain a target scene image.
And S403, if the visual auxiliary instruction is a video acquisition instruction, starting image acquisition equipment to perform video recording to obtain a target scene video.
When the instruction received by the processor of the visual auxiliary device is a video acquisition instruction, the image acquisition equipment is controlled to start a video recording function, the surrounding environment is recorded, and after the preset time length is recorded, the target scene video is obtained.
In the embodiment, the visual auxiliary instruction generated by the user executing the visual auxiliary operation is acquired, when the visual auxiliary instruction is an image acquisition instruction, image shooting is performed, when the visual auxiliary instruction is a video acquisition instruction, video recording is performed, scene description information is obtained according to the acquired image and/or video, and voice information is generated according to the scene description information. The method provided by the embodiment can be used for carrying out visual assistance through image description and video description, can be used for identifying and describing static objects and dynamic objects in the surrounding environment, is more flexible to use and strong in selectivity, and can improve the effect of visual assistance.
The following embodiments are further described in detail to realize the following method for obtaining the visual auxiliary instruction generated by the visual auxiliary operation of the user:
in one embodiment, S401 includes:
and monitoring the touch times of the user to the input key, and generating a visual auxiliary instruction according to the touch times of the user to the input key within a preset time length. Specifically, if the number of times of touch control of the input key by the user in a preset time period meets a first condition, an image acquisition instruction is generated; and if the touch times of the user to the input key within the preset time length meet a second condition, generating a video acquisition instruction.
The input key is used for inputting the visual auxiliary instruction, and the input key can be a toggle key, a rotary key, a push key, a piano key or the like. In this embodiment, the visual auxiliary operation corresponding to the image acquisition instruction and the video acquisition instruction is operated through the input key, and the two operations are distinguished by the operation times of the input key. Specifically, a corresponding relationship between the touch times and the visual auxiliary instruction is preset, for example, if the touch times for the input key within 1s is set to be less than 2 times, an image acquisition instruction is generated; and if the touch times of the input key within 1s are more than or equal to 2 times, generating a video acquisition instruction. When the user presses the input key once in 1s, the visual auxiliary device generates an image acquisition instruction and starts the image acquisition equipment to take images. When the user 1s continuously presses the input key more than 2 times, the visual aid generates a video capture instruction. In the embodiment, the corresponding visual auxiliary instruction is generated by monitoring the touch times of the user on the input keys, and the method can complete two operations only by one input key, is simple and convenient to operate, is more convenient for the operation of the visually impaired, and has strong practicability.
In one embodiment, S401 includes:
and monitoring the touch duration of the user to the input key, and generating a visual auxiliary instruction according to the touch duration of the user to the input key. Specifically, if the touch duration of the user on the input key is longer than a first duration and shorter than a second duration, an image acquisition instruction is generated; and if the touch duration of the user on the input key is greater than or equal to the second duration, generating a video acquisition instruction.
In this embodiment, the visual auxiliary operations corresponding to the image acquisition instruction and the video acquisition instruction are both operated through the input key, and the two operations are distinguished by the operation duration of the input key. Specifically, the corresponding relationship between the touch duration and the visual auxiliary instruction is preset. For example, if the touch duration of the input key is set to be greater than 1s and less than 3s, an image acquisition instruction is generated; and if the touch duration of the input key is greater than or equal to 3s, generating a video acquisition instruction. And when the user continuously presses the input key and the pressing time is less than 3s, the visual auxiliary device generates an image acquisition instruction and starts the image acquisition equipment to shoot images. When the user continuously presses the input key for more than 3s, the visual auxiliary device generates a video acquisition instruction. In the embodiment, the corresponding visual auxiliary instruction is generated by monitoring the touch duration of the user on the input key, and the method can complete two operations only by one input key, is simple and convenient to operate, is more convenient for the operation of the visually impaired, and has strong practicability.
It is understood that, in an embodiment, two keys with different structures may be provided for inputting the image capture command and the video capture command respectively. For example, a triangular key and a square key may be provided for inputting an image capture instruction and a video capture instruction, respectively. A key with a smooth surface and a key with a convex point on the surface can be arranged and are respectively used for inputting an image acquisition instruction, a video acquisition instruction and the like. Through setting up the button of isostructure, correspond the input of different vision auxiliary instruction, the visually impaired person of being convenient for seeks corresponding button through the sense of touch and accomplishes the input, improves the practicality.
In other embodiments, the input of different visual auxiliary instructions may also be achieved by inputting different gesture actions to the touch panel. For example, if a user draws a circle on the touch panel, an image acquisition instruction is generated; and (4) drawing a vertical line on the touch panel by a user to generate a video acquisition instruction. The user can execute the visual auxiliary operation only through simple gesture operation, and the visual auxiliary device is started to perform corresponding operation, so that the method is simple and convenient.
In one possible implementation, the visual aid instruction may also be generated by performing a visual aid operation by voice. S401 comprises:
acquiring a voice instruction input by a user; and analyzing the voice command to generate a visual auxiliary command.
The user sends a voice instruction to the visual auxiliary device, a microphone of the visual auxiliary device picks up the voice instruction, the voice recognizer conducts voice recognition, the content in the voice instruction is analyzed, whether the information contained in the voice instruction is image acquisition or video acquisition is judged, and then the processor generates a corresponding instruction. The user inputs instructions through voice, and the intelligence and the convenience of visual assistance are further improved.
A specific process of inputting target scene data into a preset scene description model to obtain scene description information for describing a target scene is described in detail below with reference to an embodiment.
In one embodiment, the target scene data includes a target scene image, and step S202 includes:
inputting a target scene Image into a preset Image description (Image description) model to obtain first text information for describing the content of the target scene Image.
The image acquisition equipment of the visual auxiliary device transmits the target scene image to the processor, and the processor inputs the target scene image into the image description model for processing and then outputs first text information, wherein the first text information is a sentence for describing the content of the target scene image. The information in the first text information includes, but is not limited to, visual objects in the image, positional relationships between the visual objects, states of the visual objects, and the like. For example, there is a table in the image, on which a cup is placed. The image description model can be a fully-supervised model, a semi-supervised model or an unsupervised model. The specific structure of the image description model can be designed according to actual requirements, and the application does not limit the structure at all as long as the function of the image description model can be realized. According to the method provided by the embodiment, the image description is carried out through the image description model, the recognition description algorithm is simplified, the intelligence is further improved, the accuracy of the recognition description can be improved, and the visual assistance effect is further improved.
The following describes the image description model and the specific process of using the image description model to perform image description with reference to the embodiments:
in one embodiment, the Image description model is an unsupervised Image Caption Image description model. Fig. 5 is a schematic structural diagram of an image description model in an embodiment, as shown in fig. 5, the image description model includes an image encoder, a sentence Generator (Generator), and a Discriminator (Discriminator). Among them, the image encoder may employ a Convolutional Neural Network (CNN). The sentence generator and the discriminator may employ a Long Short-Term Memory network (LSTM).
Inputting a target scene image into a preset image description model to obtain first text information for describing the content of the target scene image, and specifically comprising the following steps: inputting the target scene image into an image encoder for encoding to generate image characteristics of the target scene image; inputting the image characteristics into a sentence generator to generate an image description sentence; and inputting the image description sentence into a discriminator for discrimination, and inputting the discrimination result into a sentence generator for image and sentence reconstruction to generate first text information.
The image encoder encodes the target scene image to generate image features of the target scene image. The image features are further input to a sentence generator, which generates an image description sentence based on the image features. The discriminator is used to distinguish whether a sentence is generated by the model or from a corpus of sentences. The sentence generator is coupled to the discriminator in a different order to perform image and sentence reconstruction to generate the first textual information.
Fig. 6 is a schematic diagram of an initial image description model structure and a training process in an embodiment, and this embodiment relates to a possible implementation manner of training an initial image description model to generate an image description model. Specifically, an initial image description model is established, which includes an image encoder, a sentence generator, a discriminator, and a visual concept detector. And simultaneously establishing a sentence corpus, inputting a plurality of sample images into an initial image description model for training based on the sentence corpus, and establishing three targets for joint training to generate the unsupervised image description model. The three targets include an confrontation target, a visual concept target, and an image-sentence reconstruction target.
Specifically, regarding the countermeasure target: the sample image is input into the image encoder to obtain image characteristics, and the sentence generator generates an image description sentence based on the image characteristics and calculates the countermeasure loss. The discriminator determines whether the sentence is generated from the model or the sentence corpus based on the image description sentence, and calculates the countermeasure loss. Training is carried out based on the confrontation loss, and the confrontation loss is reduced to the maximum extent, so that the sentence generator gradually learns to generate sentences close to reality.
Regarding visual concept objectives: the sample image is input into the image encoder to obtain image characteristics, and the sentence generator generates the image description sentence based on the image characteristics. The image descriptive sentence is input into a visual concept detector, and the visual concept detector judges whether a word generated by the image descriptive sentence exists in the image or not and calculates the visual concept loss. And distilling and extracting words in the image into the image description model through the visual concept detector, so that the image description model can learn to identify the visual concept in the sample image and integrate the visual concept into the generated sentence.
Regarding the image-sentence reconstruction target: and the sentence generator generates an image description sentence based on the image characteristics and calculates the image reconstruction loss. At this time, the discriminator, as a sentence encoder, generates a sentence feature based on the image description sentence. Meanwhile, the discriminator calculates the image reconstruction loss using the sentence features as the image features. The discriminator inputs the generated sentence characteristics again to the sentence generator, which generates an image description sentence based on the sentence characteristics and calculates a sentence reconstruction loss. The sentence generator and the discriminator are coupled in different orders to perform image and sentence reconstruction, the sentence generator is trained through strategy gradient, and the sentence generator is updated by using a gradient descent method, so that sentence reconstruction loss is minimized. For the discriminator, its parameters are updated by image reconstruction penalties which combat the penalties and gradient descent.
In this embodiment, the contrast loss, the concept loss, and the image reconstruction loss are generated by three targets, a first joint loss is formed for the rigid loss and the image reconstruction loss, and a second joint loss is formed for the visual concept loss, the image reconstruction loss, and the contrast loss. The image description model is generated based on the joint training of the three losses, unsupervised learning training is achieved, the unsupervised image description model is obtained, and the generated image description model describes the image more accurately, more finely and more comprehensively. In addition, model training is carried out based on the sentence corpus, paired image-sentence data sets are eliminated, cost is saved, and the generated model can describe images more truly.
In one embodiment, the target scene data includes a target scene video, and step S202 includes:
and inputting the target scene video into a preset video description model to obtain second text information for describing the content of the target scene video.
The image acquisition equipment of the visual auxiliary device transmits the target scene video to the processor, and the processor inputs the target scene video into the video description model for processing and then outputs second text information, wherein the second text information is a sentence for describing the content of the target scene video. The information in the second text information includes, but is not limited to, the visual objects, the positional relationship between the visual objects, the motion states of the visual objects, and the like. For example, there are basketball stands in the video, and a person is "three-step basketball". The video description model can be a fully supervised model, a semi-supervised model or an unsupervised model. The specific structure of the video description model can be designed according to actual requirements, and the present application does not limit this structure at all as long as the functions thereof can be realized. According to the method, the video description is carried out through the video description model, the identification description algorithm is simplified, the intelligence is further improved, the identification description accuracy can be improved, and the visual assistance effect is further improved.
The following further describes the video description model and a specific process for applying the video description model to perform video description with reference to the embodiment:
fig. 7 is a schematic structural diagram of a video description model in an embodiment, and as shown in fig. 7, the video description model adopts an encoding-decoding (Encoder-Decoder) manner, and includes a video Encoder, a video Decoder, and a reconstructor. The video encoder can adopt a convolutional neural network, and the video decoder can adopt a long-short term memory network. The convolutional neural network captures the structure of the frame image in the video to generate a semantic representation thereof, and for a given video sequence, the convolutional neural network further fuses the generated semantic representation to generate video features by utilizing the time dynamic characteristics of the video. The long-short term memory network generates sentence fragments based on the video characteristics, and the reconstructor combines the sentence fragments to form a sentence.
Specifically, inputting the target scene video into a preset video description model to obtain second text information for describing the content of the target scene video, including:
inputting each frame image of the target scene video into a video encoder to obtain image characteristics corresponding to each frame image;
inputting the video characteristics corresponding to each frame image into a video decoder to obtain the hidden state and the initial video description sentence corresponding to each frame image;
inputting the hidden state corresponding to each frame image into a reconstructor to obtain global video characteristics and reconstruction loss;
and inputting the global video characteristics as target scene videos into a video encoder, returning to the step of executing, namely inputting each frame image of the target scene videos into the video encoder to obtain the image characteristics corresponding to each frame image until the reconstruction loss is less than a preset threshold value, and taking the obtained initial video description sentence as second text information.
The target scene video comprises a plurality of continuous frame images, and the video encoder encodes each frame image respectively to obtain the video characteristics corresponding to each frame image. And the video decoder decodes each frame image according to the time sequence to generate a hidden state corresponding to each frame image, and obtains an initial video description sentence based on the hidden state corresponding to each frame image. And the hidden state corresponding to each frame image is subjected to pooling of the average pooling layer, and sentence reconstruction is carried out by the reconstructor to generate reconstructed image characteristics corresponding to each frame image. And averagely pooling the plurality of reconstructed image features to obtain global video features. And returning the global video characteristics to the video encoder for encoding again, and repeatedly executing the processes until the reconstruction loss is less than a preset threshold value, wherein the obtained initial video description sentence is the second text information.
The training of the video description model refers to the above process, and in the training process, the long-short term memory network realizes the reconstruction of the video characteristic sequence based on the hidden state sequence of the video encoder by using backward flow. Specifically, a video encoder based on a convolutional neural network extracts semantic representation of a frame image, and a decoder based on a long-short term memory network generates natural language description aiming at visual contents. The reconstructor reproduces the frame characteristics using a backward flow from the description to the visual content. And training to obtain a video description model, wherein the model converts a video into video characteristics and then into a video description sentence, and further obtains the required text information.
In one embodiment, after the first text information and/or the second text information are obtained, text-to-speech conversion is performed on the obtained text information to obtain the voice information. Specifically, the text-to-speech conversion may be performed on one first text message to obtain the speech information, the text-to-speech conversion may be performed on one second text message to obtain the speech information, or the text-to-speech conversion may be performed on a plurality of text messages simultaneously to obtain the speech information.
Text-to-speech conversion refers to converting text information into voice information that can be played. And the visual auxiliary device converts the obtained text information to obtain corresponding voice information, and the voice information is played to a user through audio equipment.
It can be understood that the user may perform a visual assistance operation once to obtain a first text message or a second text message, and perform text-to-speech conversion on the first text message or the second text message to obtain the speech message. The user can also execute the visual auxiliary operation for multiple times, and the visual auxiliary device repeats the steps to obtain a plurality of text messages and converts the text messages to obtain the voice messages. The visual auxiliary operation executed by the user each time may be a visual auxiliary operation corresponding to the image acquisition instruction, or may also be a visual auxiliary operation corresponding to the video acquisition instruction, and the obtained plurality of text information may be a plurality of first text information, or may also be a plurality of second text information, or may also be at least one first text information and at least one second text information. Therefore, text information obtained by multiple executions is integrated to perform text-to-speech conversion, and the obtained speech information is more accurate. And the first text information and the second text information are comprehensively considered, so that the obtained voice information can effectively distinguish the state of the surrounding dynamic change objects, and the visual assistance effect is further improved.
In one embodiment, the target scene data includes both the target scene image and the target scene video, resulting in the first textual information and the second textual information. The first text information and the second text information can be subjected to fusion processing and converted into voice information. As shown in fig. 8, step S203 includes:
s801, performing feature fusion on the first text information and the second text information to obtain fusion text information.
Optionally, the first text information and the second text information may be input into a preset feature fusion model to obtain fused text information. Fig. 9 is a schematic structural diagram of a feature fusion model in an embodiment, as shown in fig. 9, the feature fusion model includes a first text encoder, a second text encoder, a feature fusion module, and a text decoder. Inputting the first text information into a first text encoder to obtain a first text characteristic; inputting the second text information into a second text encoder to obtain a second text characteristic; performing feature fusion on the first text feature and the second text feature to obtain a fusion feature; and inputting the fusion characteristics into a text decoder to obtain the fusion text information.
S802, inputting the fused text information into a preset text-to-speech conversion model to obtain speech information.
Specifically, as shown in fig. 10, fig. 10 is a schematic structural diagram of the text-to-speech conversion model in one embodiment. The Text-to-speech conversion model adopts Text-to-speech (TTS) technology and comprises an encoder, a decoder and a waveform network (WaveNet) module. And the encoder encodes the input fusion text information to generate text characteristic information. The decoder decodes the text characteristic information to generate Mel frequency spectrum information, and the waveform network module processes the Mel frequency spectrum information and outputs an audio waveform to obtain voice information.
In this embodiment, the first text information and the second text information are fused, and the obtained fused text information includes not only the feature information described by the image but also the feature information described by the video, so that the obtained fused text information can more comprehensively integrate the static and dynamic characteristics of the surrounding objects. Therefore, the voice information obtained after text-to-speech conversion is carried out on the fusion text information is more accurate, and the visual assistance effect is effectively improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Fig. 11 shows a block diagram of a visual assistance device provided in an embodiment of the present application, corresponding to the visual assistance method described in the above embodiment, and only shows a part related to the embodiment of the present application for convenience of description.
Referring to fig. 11, the apparatus includes:
an obtaining module 1101, configured to obtain target scene data;
a description module 1102, configured to input target scene data into a preset scene description model to obtain scene description information for describing a target scene;
and a voice module 1103, configured to generate voice information according to the scene description information.
In one embodiment, the voice module 1103 is specifically configured to generate navigation information according to the scene description information; and generating voice information according to the navigation information.
In one embodiment, the voice module 1103 is specifically configured to determine, according to the scene description information, information of an object in the target scene, which matches with information of a preset reference object, to obtain information of the target reference object; determining the position information of the target reference object according to the information of the target reference object; acquiring user position information; determining traveling direction information according to the position information of the target reference object and the position information of the user; and generating navigation information according to the traveling direction information.
In one embodiment, the description module 1102 is specifically configured to input the target scene image into an image encoder for encoding, and generate image features of the target scene image; inputting the image characteristics into a sentence generator to generate an image description sentence; and inputting the image description sentence into a discriminator for discrimination, and inputting the discrimination result into a sentence generator for image and sentence reconstruction to generate first text information.
In an embodiment, the description module 1102 is specifically configured to input each frame image of the target scene video into a video encoder, so as to obtain an image feature corresponding to each frame image; inputting the video characteristics corresponding to each frame image into a video decoder to obtain the hidden state and the initial video description sentence corresponding to each frame image; inputting the hidden state corresponding to each frame image into a reconstructor to obtain global video characteristics and reconstruction loss; and inputting the global video characteristics as target scene videos into a video encoder, returning to the execution step, inputting each frame image of the target scene videos into the video encoder to obtain image characteristics corresponding to each frame image until the reconstruction loss is less than a preset threshold value, and taking the obtained initial video description sentence as second text information. .
In an embodiment, the speech module 1103 is specifically configured to perform feature fusion on the first text information and the second text information to obtain fused text information; and inputting the fused text information into a preset text-to-speech conversion model to obtain the speech information.
In one embodiment, the speech module 1103 is specifically configured to input the first text information into a first text encoder, so as to obtain a first text feature; inputting the second text information into a second text encoder to obtain a second text characteristic; performing feature fusion on the first text feature and the second text feature to obtain a fusion feature; and inputting the fusion characteristics into a text decoder to obtain fusion text information.
In one embodiment, the text-to-speech conversion model includes an encoder, a decoder, and a waveform network module, and the speech module 1103 is specifically configured to input the fused text information into the encoder to obtain text feature information; inputting the text characteristic information into a decoder to obtain Mel frequency spectrum information; and inputting the Mel frequency spectrum information into a waveform network module to obtain voice information.
In one embodiment, the obtaining module 1101 is specifically configured to obtain a visual auxiliary instruction generated by a user performing a visual auxiliary operation, where the visual auxiliary instruction is an image acquisition instruction or a video acquisition instruction; if the visual auxiliary instruction is an image acquisition instruction, starting image acquisition equipment to shoot images to obtain a target scene image; and if the visual auxiliary instruction is a video acquisition instruction, starting image acquisition equipment to perform video recording to obtain a target scene video.
In one embodiment, the obtaining module 1101 is specifically configured to monitor the number of times of touching the input key by the user; and generating a visual auxiliary instruction according to the touch times of the user to the input key within the preset time length.
In an embodiment, the obtaining module 1101 is specifically configured to monitor a duration of a touch of a user on an input key; and generating a visual auxiliary instruction according to the touch duration of the user on the input key.
In one embodiment, the obtaining module 1101 is specifically configured to obtain a voice instruction input by a user; and analyzing the voice command to generate a visual auxiliary command.
It should be noted that, for the information interaction, execution process, and other contents between the above devices/units, the specific functions and technical effects thereof based on the same concept as those of the method embodiment of the present application can be specifically referred to the method embodiment portion, and are not described herein again.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
An embodiment of the present application further provides a terminal device, where the terminal device includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, the processor implementing the steps of any of the various method embodiments described above when executing the computer program.
The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.
The embodiments of the present application provide a computer program product, which when executed on a computer device, enables the computer device to implement the steps in the above method embodiments.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (15)

1. A visual assistance method, comprising:
acquiring target scene data;
inputting the target scene data into a preset scene description model to obtain scene description information for describing a target scene;
and generating voice information according to the scene description information.
2. The method of claim 1, wherein the generating speech information from the scene description information comprises:
generating navigation information according to the scene description information;
and generating the voice information according to the navigation information.
3. The method of claim 2, wherein the generating navigation information from the scene description information comprises:
determining information of an object matched with information of a preset reference object in the target scene according to the scene description information to obtain information of the target reference object;
determining the position information of the target reference object according to the information of the target reference object;
acquiring user position information;
determining traveling direction information according to the position information of the target reference object and the position information of the user;
and generating the navigation information according to the traveling direction information.
4. The method of claim 1, wherein the target scene data comprises a target scene image, the scene description model comprises an image encoder, a sentence generator, and a discriminator, and the scene description information comprises first textual information; the step of inputting the target scene data into a preset scene description model to obtain scene description information for describing a target scene includes:
inputting the target scene image into the image encoder for encoding to generate image characteristics of the target scene image;
inputting the image characteristics into the sentence generator to generate an image description sentence;
and inputting the image description sentence into the discriminator for discrimination, and inputting the discrimination result into the sentence generator for image and sentence reconstruction to generate the first text information.
5. The method of claim 4, wherein the target scene data comprises a target scene video, the scene description model comprises a video encoder, a video decoder, and a reconstructor, and the scene description information comprises second text information; the step of inputting the target scene data into a preset scene description model to obtain scene description information for describing a target scene includes:
inputting each frame image of the target scene video into the video encoder to obtain image characteristics corresponding to each frame image;
inputting the video characteristics corresponding to each frame image into the video decoder to obtain the hidden state and the initial video descriptive sentence corresponding to each frame image;
inputting the hidden state corresponding to each frame image into the reconstructor to obtain global video characteristics and reconstruction loss;
and inputting the global video characteristics as the target scene video into the video encoder, returning to the step of executing, inputting each frame image of the target scene video into the video encoder, obtaining image characteristics corresponding to each frame image until the reconstruction loss is less than a preset threshold value, and taking the obtained initial video description sentence as the second text information.
6. The method of claim 5, wherein the generating the voice information according to the scene description information comprises:
performing feature fusion on the first text information and the second text information to obtain fused text information;
and inputting the fused text information into a preset text-to-speech conversion model to obtain the voice information.
7. The method according to claim 6, wherein the performing feature fusion on the first text information and the second text information to obtain fused text information comprises:
inputting the first text information into a first text encoder to obtain a first text characteristic;
inputting the second text information into a second text encoder to obtain a second text characteristic;
performing feature fusion on the first text feature and the second text feature to obtain a fusion feature;
and inputting the fusion characteristics into a text decoder to obtain the fusion text information.
8. The method of claim 6, wherein the text-to-speech model comprises an encoder, a decoder, and a waveform network module, and the inputting the fused text message into a predetermined text-to-speech model to obtain the speech message comprises:
inputting the fused text information into the encoder to obtain text characteristic information;
inputting the text characteristic information into the decoder to obtain Mel frequency spectrum information;
and inputting the Mel frequency spectrum information into the waveform network module to obtain the voice information.
9. The method according to any one of claims 1 to 8, wherein the target scene data comprises a target scene image and/or a target scene video, and the acquiring the target scene data comprises:
acquiring a visual auxiliary instruction generated by a user executing visual auxiliary operation, wherein the visual auxiliary instruction is an image acquisition instruction or a video acquisition instruction;
if the visual auxiliary instruction is the image acquisition instruction, starting image acquisition equipment to perform image shooting to obtain the target scene image;
and if the visual auxiliary instruction is the video acquisition instruction, starting image acquisition equipment to perform video recording to obtain the target scene video.
10. The method of claim 9, wherein obtaining the visual aid instruction generated by the user performing the visual aid operation comprises:
monitoring the touch times of a user on an input key;
and generating the visual auxiliary instruction according to the touch times of the user on the input key within a preset time length.
11. The method of claim 9, wherein obtaining the visual aid instruction generated by the user performing the visual aid operation comprises:
monitoring the touch duration of the user on the input key;
and generating the visual auxiliary instruction according to the touch duration of the user on the input key.
12. The method of claim 9, wherein obtaining the visual aid instruction generated by the user performing the visual aid operation comprises:
acquiring a voice instruction input by a user;
and analyzing the voice command to generate the visual auxiliary command.
13. A visual aid, comprising:
the acquisition module is used for acquiring target scene data;
the description module is used for inputting the target scene data into a preset scene description model to obtain scene description information for describing a target scene;
and the voice module is used for generating voice information according to the scene description information.
14. A visual aid, comprising: memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the method according to any of claims 1 to 12 when executing the computer program.
15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 12.
CN202110106026.4A 2021-01-26 2021-01-26 Visual assistance method, device and computer readable storage medium Pending CN114792393A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110106026.4A CN114792393A (en) 2021-01-26 2021-01-26 Visual assistance method, device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110106026.4A CN114792393A (en) 2021-01-26 2021-01-26 Visual assistance method, device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN114792393A true CN114792393A (en) 2022-07-26

Family

ID=82459653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110106026.4A Pending CN114792393A (en) 2021-01-26 2021-01-26 Visual assistance method, device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114792393A (en)

Similar Documents

Publication Publication Date Title
US11423909B2 (en) Word flow annotation
JP5323770B2 (en) User instruction acquisition device, user instruction acquisition program, and television receiver
Kessous et al. Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis
Al-Ahdal et al. Review in sign language recognition systems
US9223786B1 (en) Communication in a sensory immersive motion capture simulation environment
CN110598576A (en) Sign language interaction method and device and computer medium
Arsan et al. Sign language converter
WO2019051082A1 (en) Systems, methods and devices for gesture recognition
Minotto et al. Multimodal multi-channel on-line speaker diarization using sensor fusion through SVM
Niewiadomski et al. Automated laughter detection from full-body movements
CN113835522A (en) Sign language video generation, translation and customer service method, device and readable medium
JP6166234B2 (en) Robot control apparatus, robot control method, and robot control program
Patwardhan et al. Augmenting supervised emotion recognition with rule-based decision model
CN113658254A (en) Method and device for processing multi-modal data and robot
WO2017191713A1 (en) Control device, control method, and computer program
CN114779922A (en) Control method for teaching apparatus, control apparatus, teaching system, and storage medium
CN107452381B (en) Multimedia voice recognition device and method
KR100348823B1 (en) Apparatus for Translating of Finger Language
CN111063024A (en) Three-dimensional virtual human driving method and device, electronic equipment and storage medium
Chu et al. Multimodal real-time contingency detection for HRI
CN113497912A (en) Automatic framing through voice and video positioning
CN112711331A (en) Robot interaction method and device, storage equipment and electronic equipment
CN114792393A (en) Visual assistance method, device and computer readable storage medium
Khan et al. Sign language translation in urdu/hindi through microsoft kinect
Kunapareddy et al. Smart Vision based Assistant for Visually Impaired

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination