CN111046223A

CN111046223A - Voice assisting method, terminal, server and system for visually impaired

Info

Publication number: CN111046223A
Application number: CN201911113176.7A
Authority: CN
Inventors: 李秉伦
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-11-14
Filing date: 2019-11-14
Publication date: 2020-04-21

Abstract

The invention provides a voice auxiliary method, a terminal, a server and a system for a visually impaired person, wherein the method comprises the following steps: acquiring a text carrier, and converting the text carrier into an image to be identified; transmitting the image to be recognized to a server so that the server positions a text box comprising characters on the image to be recognized through a character recognition model, performing character recognition on the text box to obtain characters and correcting the characters, converting the corrected characters into an audio file, storing the audio file and obtaining a storage address of the audio file; the method and the device can convert characters on a character carrier into voice so as to assist the daily life of the visually impaired.

Description

Voice assisting method, terminal, server and system for visually impaired

Technical Field

The invention relates to the technical field of vision assistance, in particular to a voice assistance method, terminal, server and system for a visually impaired person.

Background

Visually impaired people generally cannot clearly observe the surrounding environment due to lack of vision, and need to rely on other senses to a great extent to sense the surrounding environment. Therefore, the visually impaired cannot obtain information from the text carrier, and the independent life and the trip are difficult, so that the life and the study are inconvenient.

Disclosure of Invention

The invention aims to provide a voice assisting method for a person with visual impairment, which can convert characters on a character carrier into voice so as to assist the daily life of the person with visual impairment. Another object of the present invention is to provide a terminal. It is yet another object of the present invention to provide a server. It is a further object of the present invention to provide a computer apparatus. It is a further object of this invention to provide a computer readable medium. It is still another object of the present invention to provide a speech assistance system for visually impaired.

In order to achieve the above object, the present invention discloses a method for assisting a visually impaired person with speech, comprising:

acquiring a text carrier, and converting the text carrier into an image to be identified;

transmitting the image to be recognized to a server so that the server positions a text box comprising characters on the image to be recognized through a character recognition model, performing character recognition on the text box to obtain characters and correcting the characters, converting the corrected characters into an audio file, storing the audio file and obtaining a storage address of the audio file;

and receiving a storage address of the audio file transmitted by the server, acquiring the audio file according to the storage address and playing the audio file to a user.

Preferably, the acquiring the text carrier specifically includes:

collecting at least one image through image collection equipment to form the character carrier; alternatively, the first and second electrodes may be,

and receiving a voice auxiliary request transmitted by the terminal APP, and acquiring a character carrier to be identified in the terminal APP.

Preferably, the text carrier comprises one or more of an image, a video, WORD, PDF, and PPT.

Preferably, the method further comprises:

receiving a voice instruction of a visually impaired person and/or a touch instruction input through a touch display screen;

analyzing the voice command and/or the touch command to obtain control information;

and determining a control instruction according to the control information and executing corresponding operation.

Preferably, the corresponding operations include one or more of the following operations:

acquiring a character carrier;

displaying the detection progress of the text carrier;

playing the obtained audio file;

stopping playing the audio file;

and displaying the recognized characters.

The invention also discloses a voice auxiliary method for the visually impaired, which comprises the following steps:

receiving an image to be identified transmitted by a terminal;

positioning a text box comprising characters on the image to be recognized through a character recognition model, performing character recognition on the text box to obtain characters, correcting the characters, converting the corrected characters into an audio file to recognize the characters on the image to be recognized, converting the characters into the audio file, storing the audio file and obtaining a storage address of the audio file;

and feeding back the storage address of the audio file to the terminal so that the terminal can acquire the audio file according to the storage address and play the audio file to a user.

Preferably, the positioning, by the character recognition model, the text box including the characters on the image to be recognized, performing character recognition on the text box to obtain the characters, and correcting the characters specifically includes:

positioning at least one text box containing characters on the image to be recognized through a CPTN unit;

performing character recognition on the text box through a DenseNet unit to obtain characters;

and correcting the characters through a CTC unit to obtain the characters according with the natural language rule.

Preferably, the method further comprises, before obtaining the text carrier:

determining setting parameters of the CPTN unit, the DenseNet unit and the CTC unit;

forming a CPTN unit, a DenseNet unit and a CTC unit which are sequentially connected according to the setting parameters to form a neural network model;

selecting the existing image data to form a training sample, training the neural network model to obtain optimized setting parameters, and forming the character recognition model based on deep learning according to the optimized setting parameters.

positioning a text box comprising characters on the image to be recognized through a character recognition model, performing character recognition on the text box to obtain the characters, correcting the characters, converting the corrected characters into an audio file, storing the audio file and obtaining a storage address of the audio file;

and acquiring the audio file according to the storage address and playing the audio file to a user.

The invention also discloses a terminal, comprising:

the first preprocessing unit is used for acquiring a text carrier and converting the text carrier into an image to be identified;

the first sending unit is used for transmitting the image to be recognized to a server so that the server positions a text box comprising characters on the image to be recognized through a character recognition model, performs character recognition on the text box to obtain the characters and corrects the characters, converts the corrected characters into an audio file, stores the audio file and obtains a storage address of the audio file;

and the first receiving unit is used for receiving the storage address of the audio file transmitted by the server, acquiring the audio file according to the storage address and playing the audio file to a user.

The invention also discloses a server, comprising:

the second receiving unit is used for receiving the image to be identified transmitted by the terminal;

the first identification conversion unit is used for positioning a text box comprising characters on the image to be identified through a character identification model, performing character identification on the text box to obtain the characters and correcting the characters, converting the corrected characters into an audio file, and storing the audio file and obtaining a storage address of the audio file;

and the second sending unit is used for feeding back the storage address of the audio file to the terminal so that the terminal can acquire the audio file according to the storage address and play the audio file to a user.

The invention also discloses a voice auxiliary system for the visually impaired, which comprises a second preprocessing unit, a second recognition conversion unit and an audio file acquisition unit;

the second preprocessing unit is used for acquiring a text carrier and converting the text carrier into an image to be identified;

the second identification conversion unit is used for positioning a text box comprising characters on the image to be identified through a character identification model, performing character identification on the text box to obtain the characters and correcting the characters, converting the corrected characters into an audio file, and storing the audio file and obtaining a storage address of the audio file;

and the audio file acquisition unit is used for acquiring the audio file according to the storage address and playing the audio file to a user.

The invention also discloses a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor,

the processor, when executing the program, implements the method as described above.

The invention also discloses a computer-readable medium, having stored thereon a computer program,

which when executed by a processor implements the method as described above.

The invention can obtain different forms of character carriers through various ways, obtains the image to be identified through splitting, compressing and the like of the character carriers, and then transmits the image to be identified to the server. The method comprises the steps of carrying out character recognition on an image to be recognized through a character recognition model preset on a server to obtain characters on the image to be recognized, wherein the character recognition model firstly recognizes a text box, further identifies characters in the text box in a targeted manner, corrects the characters obtained through recognition, realizes logic processing operations of correcting the recognized characters without logic into characters according with natural language rules and the like, and further converts the characters into audio files for normal people to speak. The method comprises the steps of converting characters into audio files, transmitting storage addresses of the audio files back to the terminal, enabling the terminal to access network positions for storing the audio files through the storage addresses and obtaining the audio files, and then playing the audio files to a user through audio equipment such as a loudspeaker and the like arranged on the terminal, so that visually-impaired people can obtain corresponding information without needing to clearly see the characters on a character carrier.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of one embodiment of a method for assisting a visually impaired person with speech according to the present invention;

FIG. 2 is a diagram illustrating an image of an object in an example of a speech assistance method for visually impaired according to the present invention;

FIG. 3 is a diagram illustrating a recognition result of a text box of an image according to a specific example of the speech assistance method for visually impaired according to the present invention;

FIG. 4 is a diagram illustrating a text recognition result of an image according to a specific example of the speech assistance method for visually impaired according to the present invention;

FIG. 5 is a diagram showing an image of an object in another embodiment of the speech assistance method for visually impaired according to the present invention;

FIG. 6 is a diagram illustrating a recognition result of a text box of an image according to another embodiment of the speech assistance method for visually impaired persons of the present invention;

FIG. 7 is a diagram illustrating a text recognition result of an image according to another embodiment of the speech assistance method for visually impaired according to the present invention;

FIG. 8 is a schematic diagram of CPTN in an embodiment of a method for assisting a visually impaired person with speech according to the present invention;

FIG. 9 is a schematic diagram of DenseNet in an embodiment of the speech assistance method for visually impaired according to the present invention;

FIG. 10 is a second flowchart of an embodiment of a method for assisting a visually impaired person in speech according to the present invention;

FIG. 11 is a third flowchart of an embodiment of a method for assisting a visually impaired person in speech according to the present invention;

FIG. 12 is a fourth flowchart of an embodiment of a method for assisting a visually impaired person in speech according to the present invention;

FIG. 13 is a diagram illustrating an APP program interface in one embodiment of a method for assisting a visually impaired person with speech according to the present invention;

FIG. 14 is a fifth flowchart of an embodiment of a method for assisting a visually impaired person in speech according to the present invention;

FIG. 15 is a flowchart illustrating a sixth embodiment of a method for assisting a visually impaired person with speech according to the present invention;

FIG. 16 is a seventh flowchart of an embodiment of a method for assisting a visually impaired person with speech according to the present invention;

FIG. 17 is a flow chart of an eighth embodiment of a method for assisting a visually impaired person with speech according to the present invention;

FIG. 18 is a flow chart showing a ninth embodiment of a method for assisting a visually impaired person with speech according to the present invention;

FIG. 19 is a flow chart showing ten aspects of one embodiment of a method for assisting a visually impaired person with speech according to the present invention;

FIG. 20 is a block diagram illustrating one embodiment of a terminal of the present invention;

FIG. 21 is a block diagram illustrating one embodiment of a server of the present invention;

FIG. 22 is a block diagram illustrating one embodiment of a vision-impaired speech assistance system of the present invention;

FIG. 23 illustrates a schematic block diagram of a computer device suitable for use in implementing embodiments of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Generally, a person with visual impairment refers to a person with visual impairment who has a partial or complete structural or functional disorder of visual organs (eyeball optic nerve, brain visual center) due to congenital or acquired reasons, and can not (or is very difficult) recognize the vision of external things after treatment. The visual disorder referred to in the present invention is wide ranging from poor vision to complete invisibility, wherein the symptoms of poor vision include blurred vision, cloudiness, hypermetropia or myopia, color blindness, tubular vision, and the like. For the visually impaired with serious vision deterioration, the glasses cannot solve the influence caused by the vision deterioration, and other auxiliary tools are needed to assist or replace the user to perform the visual identification.

Based on the above problem, the present embodiment provides a speech assistance system for visually impaired, the system including a terminal and a server. It is understood that the terminal may be one of a smart phone, a tablet electronic device, a network set-top box, a portable computer, a desktop computer, a Personal Digital Assistant (PDA), an in-vehicle device, and a smart wearable device. Wherein, intelligence wearing equipment can be for one in equipment such as intelligent glasses, intelligent wrist-watch and intelligent bracelet.

In practical application, the part for performing character recognition on the image to be recognized and converting the recognized characters into voice can be executed on the server side as described above, and all operations can also be completed in the terminal, that is, the server can be integrated in the terminal, and the terminal can also be integrated in the server, so that the auxiliary function of the visually impaired can be realized in the form of an integral device. The selection may be specifically performed according to the processing capability of the terminal, the limitation of the user usage scenario, and the like. This is not a limitation of the present application. The terminal may further include a processor if all operations are performed in the terminal.

The terminal may have a communication module (i.e., a communication unit), and may be communicatively connected to a remote server to implement data transmission with the server. The server may include a server on the task scheduling center side, and in other implementation scenarios, the server may also include a server on an intermediate platform, for example, a server on a third-party server platform that is communicatively linked to the task scheduling center server. The server may include a single computer device, or may include a server cluster formed by a plurality of servers, or a server structure of a distributed apparatus.

The server and the terminal may communicate using any suitable network protocol, including network protocols not yet developed at the filing date of this application. The network protocols may include, for example, TCP/IP protocol, UDP/IP protocol, HTTP protocol, HTTPS protocol, and the like. Of course, the network Protocol may also include, for example, an RPC Protocol (Remote Procedure Call Protocol), a REST Protocol (Representational State Transfer Protocol), and the like used above the above Protocol.

Based on the above problem, according to an aspect of the present invention, the present embodiment first discloses a method for assisting a visually impaired person with speech. As shown in fig. 1, in this embodiment, the method includes:

s100: and acquiring a text carrier, and converting the text carrier into an image to be identified.

S200: transmitting the image to be recognized to a server so that the server positions a text box comprising characters on the image to be recognized through a character recognition model, performing character recognition on the text box to obtain characters and correcting the characters, converting the corrected characters into an audio file, storing the audio file and obtaining a storage address of the audio file;

s300: and receiving a storage address of the audio file transmitted by the server, acquiring the audio file according to the storage address and playing the audio file to a user.

It can be understood that the method can be implemented by an application program arranged in the terminal, a functional module capable of analyzing a plurality of different text carriers can be formed in the application program of the terminal by the prior art, and the functional module can identify and analyze different text carriers by calling an external program or setting a plug-in form, for example, to obtain an image to be identified. The text carrier may include one or more of an image, a video, WORD, PDF, and PPT, and in practical application, other types of text carriers may also be obtained according to actual needs, which is not limited in the present invention.

In a preferred embodiment, the server may be pre-configured with a text recognition model. The character recognition model is obtained by a deep learning technology. In practical application, the character recognition model can be a functional model formed by an OCR recognition technology, and characters can be obtained by performing character recognition on an image to be recognized by the OCR recognition technology. In the embodiment, in order to improve the accuracy of character recognition, a character recognition model is formed by adopting a deep learning technology, the deep learning can automatically extract features and complete parameter training for input pictures and characters, character recognition is carried out without depending on subjective judgment of people, the anti-interference capability is very strong, pictures under a complex background can be recognized, the recognition precision is high, and the problems of low recognition precision and low recognition accuracy of character recognition through OCR can be solved.

In a preferred embodiment, the character recognition model is used for positioning at least one text box containing characters on the image to be recognized through the CPTN unit, then performing character recognition on the text box through the DenseNet unit to obtain characters, and correcting the characters through the CTC unit to obtain characters according with the natural language rule.

Specifically, a character recognition model is formed by using a CPTN (continuous textual forward network) technology in deep learning to realize the positioning of a text in an image to be recognized, and then the Connection Temporal Classification (CTC) and the DenseNet are combined to realize the recognition of characters in a positioning area, so that the accurate recognition of the characters in the image is realized. By way of example, in a specific example, fig. 2 shows a text carrier of an image of an object in the natural environment captured by an image capturing device, a text box including text can be located by a CPTN unit, i.e., an area within a rectangular box shown in fig. 3, the text in the text box can be further identified by a DenseNet unit, and the identified text is shown in fig. 4. In another specific example, fig. 5 shows an image to be recognized processed according to a received text carrier in the form of PDF, etc., a text box including text can be located by the CPTN unit, i.e., an area within the rectangular box shown in fig. 6, and further the text in the text box can be recognized by the DenseNet unit, and the recognized text is as shown in fig. 7.

As shown in fig. 8 and 9, the first structure of the CPTN cell is a convolution map (conv5) in the VGG16 model, and a 3 × 3 spatial window is used to densely slide on the feature map after the image has been convolved. The sequences generated by spatial window sliding are then recursively concatenated by a BilSTM network that can interpret the dependencies and correlations between the sequence texts. The output of the BilSTM is then concatenated to a fully concatenated layer, followed by three different output layers, together predicting the text/non-text scores, y-axis coordinates and edge refinement offsets in k (positive integers) pre-set fixed regions. Namely, the area of the text box including the characters in the image can be obtained through the CPTN unit, and the text box area can be positioned through the CPTN unit.

DenseNet has two distinct features. The first is that it connects all layers directly. The second property is that each layer network takes input from all layers in front and passes its own property map to all layers behind. In the embodiment, the DenseNet is used as a main algorithm to form a DenseNet unit, the character recognition in the text box area obtained by positioning the CPTN unit is realized through the DenseNet unit, the gradient disappearance problem can be effectively solved through the DenseNet unit, the characteristic propagation is enhanced, the characteristic reuse can be supported, and the parameter number is greatly reduced.

Since the output of the DenseNet unit is noisy, not accurately segmented data, the output data is further integrated using a CTC unit formed based on CTC technology to obtain the final textual prediction result. CTC is a correlated scoring function for neural network outputs for tagging scoring neural network sequence output data, thereby eliminating the process of segmenting and integrating training data and outputs. To address the issue of output unfragmentation, CTCs introduce a new tag, commonly referred to as placeholder characters, in the allowed output set. Placeholder characters mean that they do not correspond to any output and need to be deleted from the output. Taking the output hello as an example, if the output in a row contains at least two consecutive identical characters, a placeholder character e can be used to distinguish each character for generating the result "hello" rather than "hello". In the algorithm, the output "hello" is used as a result, which can be predicted backwards to align with the input length. Therefore, it is necessary to use a loss function:

where p (π | x) is the probability that, given an x input, the corresponding observed sequence is π. The summation of this function is performed over all observation sequences pi predicting B (pi) ═ l and estimates of the parameters are obtained by maximization of p (l | x), thus determining the context of all the words. The invention identifies the Chinese characters in the image by combining DenseNet and CTC. In other embodiments, other deep learning algorithms may be selected to form the character recognition model according to the requirements of recognition accuracy and efficiency, which is not limited in the present invention.

On the basis, the character recognition model is arranged on the server in advance, for example, a cloud server provided by a merchant can be selected, the character recognition model can be called to perform character recognition on the image to be recognized by accessing the server, and the recognized characters are converted into audio files and fed back to the terminal. This terminal preferred can be mobile terminal, through the data communication of terminal with the server, further feeds back the audio file to the user through the APP that sets up in mobile terminal, can realize character recognition in the natural scene, object recognition and speech recognition convert invisible characters into audible pronunciation, help the visual disturbance crowd to carry out shopping more conveniently, take daily life such as public transport means. Meanwhile, the App is also suitable for common people to use, so that the public can obtain the convenience brought by a deep neural network and artificial intelligence.

In a preferred embodiment, when forming the character recognition model, setting parameters of the CPTN unit, the DenseNet unit, and the CPC unit, such as the size of the window, may be determined. And sequentially connecting the CPTN unit, the DenseNet unit and the CPC unit according to the setting parameters to form a neural network model. Selecting the existing image data to form a training sample, training the neural network model to obtain optimized setting parameters, and forming the character recognition model based on deep learning according to the optimized setting parameters. The training sample can adopt a large number of sample images containing characters, the sample images comprise corresponding marks, the sample images with the marks are used as the training samples and input into a preliminarily established neural network for training, and of course, the sample images without the marks can also be used as the training samples. Preferably, the CTPN network can be trained using the ICDAR data set with a model Recall (Recall) of 79% and Precision (Precision) of 89% on the test set. The training data for the DenseNet unit is a Chinese corpus, which may contain 3,640,000 images containing 5,990 different Chinese characters. The accuracy of the DenseNet unit on the test set can reach 0.983.

In an alternative embodiment, as shown in fig. 10, the acquiring the text carrier in S100 may include:

s110: and acquiring at least one image through image acquisition equipment on the terminal to form the character carrier. Generally, an image capturing device, such as a camera, is configured on a mobile terminal, such as a mobile phone or an IPAD, but the image capturing device may also be an external image capturing device electrically connected to the terminal, such as an external camera electrically connected to the terminal. Through obtaining at least one image that image acquisition equipment gathered, regard at least one image as the word carrier to in the server is given in the transmission after the preliminary treatment, then the user accessible terminal acquires the image in the natural scene and obtains the literal information in the natural scene through discerning and converting into pronunciation when using, need not to seek other people's help, the use under the scene of being convenient for vision disorder person alone of going out.

In other alternative embodiments, as shown in fig. 11, the acquiring the text carrier in S100 may include:

s120: and receiving a voice auxiliary request transmitted by the terminal APP, and acquiring a character carrier to be identified in the terminal APP. In the optional implementation, text carriers received or stored in other application programs (APPs) in the terminal, such as documents of PPT or PDF, may also be acquired, the documents of PPT or PDF are used as text carriers to be recognized, and further each page in PPT or PDF may be converted into an image to be recognized, so that an audio file may be obtained by transmitting the text information to the server for text recognition and conversion, and thus the user may acquire text information in the received document, and communicate with other users through social software.

In a preferred embodiment, as shown in fig. 12, the method further comprises:

s101: and receiving a voice instruction of the visually impaired and/or a touch instruction input through the touch display screen.

S102: and analyzing the voice command and/or the touch command to obtain control information.

S103: and determining a control instruction according to the control information and executing corresponding operation.

Specifically, the method described in this embodiment may be implemented by an APP of the terminal. Specifically, the terminal APP can show a program interface to a user, the program interface can be provided with at least one virtual key, in an optional implementation manner, a touch instruction input by the user through a touch display screen can be obtained by detecting a touch position of the user on the program interface, and which virtual key is pressed by the user is determined, so that corresponding operation is executed. For example, as shown in fig. 13, four virtual keys for photographing, checking the detection progress, playing and stopping may be set on the program interface, when the user touches the region corresponding to the photographing, it is determined that the user has pressed the photographing key by detecting the touch position of the user, so that the image capturing device of the terminal APP control terminal performs photographing and image capturing to form a text carrier, the APP compresses the text carrier and transmits the compressed text carrier to the server for text recognition and conversion to obtain an audio file, the audio file may be stored in a format such as mp3, wma or wav, the APP may receive an audio file storage address transmitted by the server, and the APP may obtain the audio file locally through the storage address.

In a preferred embodiment, the corresponding operation in S103 includes one or more of the following operations: acquiring a character carrier; displaying the detection progress of the text carrier; playing the obtained audio file; stopping playing the audio file; and displaying the recognized characters. For example, when a user touches an area corresponding to the checking progress, it is determined that the user presses a checking progress button through detecting a touch position of the user, different checking progresses can be preset according to the running state of the APP, for example, when a character carrier is processed to obtain an image to be recognized, the corresponding checking progress is "in processing", after the image to be recognized is transmitted to the server, the corresponding checking progress is "in recognition conversion", and after a storage address of an audio file is received, the corresponding checking progress is "recognition conversion completed". When the APP runs in different states, corresponding detection progress can be displayed for a user. When the user touches the corresponding area for playing or stopping, the touch position of the user is detected to determine that the user presses a playing or stopping key, so that corresponding operation is executed, and an audio file is played or stopped to be played to the user through an audio device (such as a loudspeaker) of the terminal.

Fig. 13 shows a specific example of the program interface of the terminal APP, and the level interface is only provided with four virtual keys, so that a large area of virtual keys can be used for the visually impaired to use. Preferably, for main function button such as "shoot", the effective range that can set up the button is 1/3 above at least the area of terminal interface, touch in the effective range all can make APP carry out corresponding operation, to the totally visually impaired person, also can rely on sense of touch and experience to operate alone through the touch, for the APP that has complicated function and option now, obtain the audio file through a key word discernment and conversion, realize the simplification of APP program interface, some middle or unnecessary control button has been got rid of, do not influence the user and use and be convenient for visually impaired person's operation, easy operation easily carries out.

In a preferred embodiment, as shown in fig. 14, the method further comprises:

s400: and receiving the characters transmitted by the server and displaying the characters to the user. It can be understood that the server can further return the recognized characters to the terminal, and the terminal can display the recognized characters to the user, so that the user can further perform operations such as character processing according to the requirement.

Based on the same principle, the embodiment also discloses a voice assisting method for the visually impaired. As shown in fig. 15, the method includes:

s500: and receiving the image to be identified transmitted by the terminal.

S600: the method comprises the steps of positioning a text box comprising characters on an image to be recognized through a character recognition model, performing character recognition on the text box to obtain the characters, correcting the characters, converting the corrected characters into an audio file to recognize the characters on the image to be recognized, converting the characters into the audio file, storing the audio file and obtaining a storage address of the audio file.

S700: and feeding back the storage address of the audio file to the terminal so that the terminal can acquire the audio file according to the storage address and play the audio file to a user.

In a preferred embodiment, as shown in fig. 16, the positioning, in S600, a text box including characters on the image to be recognized by using a character recognition model, and performing character recognition on the text box to obtain the characters and correcting the characters specifically includes:

s610: and positioning at least one text box containing characters on the image to be recognized through a CPTN unit.

S620: and performing character recognition on the text box through a DenseNet unit to obtain characters.

S630: and correcting the characters through a CTC unit to obtain the characters according with the natural language rule.

In a preferred embodiment, as shown in fig. 17, the method further includes S800:

s800: and transmitting the characters to a terminal so that the terminal displays the characters obtained by identification to a user.

In a preferred embodiment, the method further includes a step S000 before receiving the image to be recognized transmitted by the terminal, as shown in fig. 18, S000 may specifically include:

s010: the setup parameters for the CPTN unit, DenseNet unit, and CTC unit are determined.

S020: and forming a CPTN unit, a DenseNet unit and a CTC unit which are sequentially connected according to the set parameters to form a neural network model.

S030: selecting the existing image data to form a training sample, training the neural network model to obtain optimized setting parameters, and forming the character recognition model based on deep learning according to the optimized setting parameters.

Because the principle of solving the problems by the method is similar to that of the method, the implementation of the method can be referred to the implementation of the method, and details are not repeated herein.

Based on the same principle, the embodiment also discloses a voice assisting method for the visually impaired. As shown in fig. 19, the method includes:

s1100: and acquiring a text carrier, and converting the text carrier into an image to be identified.

S1200: positioning a text box comprising characters on the image to be recognized through a character recognition model, performing character recognition on the text box to obtain the characters, correcting the characters, converting the corrected characters into an audio file, storing the audio file and obtaining a storage address of the audio file.

S1300: and acquiring the audio file according to the storage address and playing the audio file to a user.

It should be noted that the method of this embodiment is characterized in that the main implementation body of the method of this embodiment is a computer device (for example, a mobile phone), and is not implemented by the cooperation of a terminal and a server, and certainly does not need the participation of a network system.

Based on the same principle, the embodiment also discloses a terminal. As shown in fig. 20, in this embodiment, the terminal includes:

the first preprocessing unit 11 is configured to obtain a text carrier and convert the text carrier into an image to be recognized;

the first sending unit 12 is configured to transmit the image to be recognized to a server, so that the server locates a text box including characters on the image to be recognized through a character recognition model, performs character recognition on the text box to obtain characters, corrects the characters, converts the corrected characters into an audio file, stores the audio file, and obtains a storage address of the audio file;

and the first receiving unit 13 is configured to receive a storage address of the audio file transmitted by the server, acquire the audio file according to the storage address, and play the audio file to a user.

Since the principle of the terminal to solve the problem is similar to the above method, the implementation of the terminal may refer to the implementation of the method, and is not described herein again.

Based on the same principle, the embodiment also discloses a server. As shown in fig. 21, the present embodiment includes:

a second receiving unit 21, configured to receive the image to be identified transmitted by the terminal;

the first identification conversion unit 22 is configured to locate a text box including characters on the image to be identified through a character identification model, perform character identification on the text box to obtain characters and correct the characters, convert the corrected characters into an audio file, store the audio file, and obtain a storage address of the audio file;

and the second sending unit 23 is configured to feed back the storage address of the audio file to the terminal, so that the terminal obtains the audio file according to the storage address and plays the audio file to the user.

Since the principle of solving the problem by the server is similar to the above method, the implementation of the server may refer to the implementation of the method, and is not described herein again.

Based on the same principle, the embodiment also discloses a voice auxiliary system for the visually impaired. As shown in fig. 22, in the present embodiment, the system includes a second preprocessing unit 31, a second recognition converting unit 32, and an audio file acquiring unit 33.

The second preprocessing unit 31 is configured to obtain a text carrier, and convert the text carrier into an image to be recognized.

The second recognition conversion unit 32 is configured to locate a text box including characters on the image to be recognized through a character recognition model, perform character recognition on the text box to obtain characters and correct the characters, convert the corrected characters into an audio file, store the audio file, and obtain a storage address of the audio file.

The audio file obtaining unit 33 is configured to obtain the audio file according to the storage address and play the audio file to the user.

It should be noted that, in the system of the present embodiment, the second preprocessing unit 31, the second identification converting unit 32 and the audio file obtaining unit 33 belong to a same computer device (for example, a mobile phone).

Since the principle of the system for solving the problem is similar to the above method, the implementation of the system can refer to the implementation of the method, and the detailed description is omitted here.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer device, which may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

In a typical example, the computer device specifically comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method performed by the client as described above when executing the program, or the processor implementing the method performed by the server as described above when executing the program.

Referring now to FIG. 23, shown is a schematic block diagram of a computer device 600 suitable for use in implementing embodiments of the present application.

As shown in fig. 23, the computer apparatus 600 includes a Central Processing Unit (CPU)601 which can perform various appropriate works and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM)) 603. In the RAM603, various programs and data necessary for the operation of the system 600 are also stored. The CPU601, ROM602, and RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output section 607 including a Cathode Ray Tube (CRT), a liquid crystal feedback (LCD), and the like, and a speaker and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted as necessary on the storage section 608.

In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for assisting a visually impaired with speech, comprising:

2. The method as claimed in claim 1, wherein the step of obtaining the text carrier comprises:

3. The visually impaired speech assistance method of claim 1 wherein the text carrier comprises one or more of an image, a video, a WORD, a PDF, and a PPT.

4. The method of claim 1, further comprising:

5. The visually impaired speech assistance method of claim 4, wherein the corresponding operations comprise one or more of:

acquiring a character carrier;

displaying the detection progress of the text carrier;

playing the obtained audio file;

stopping playing the audio file;

and displaying the recognized characters.

6. A method for assisting a visually impaired with speech, comprising:

receiving an image to be identified transmitted by a terminal;

7. The method as claimed in claim 6, wherein the positioning of the text box including the text on the image to be recognized by the text recognition model, the performing of the text recognition on the text box to obtain the text and the correcting of the text specifically comprises:

8. The method of claim 7, further comprising, prior to obtaining the text carrier:

9. A method for assisting a visually impaired with speech, comprising:

10. A terminal, comprising:

11. A server, comprising:

12. The voice auxiliary system for the visually impaired is characterized by comprising a second preprocessing unit, a second recognition conversion unit and an audio file acquisition unit;

13. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor,

the processor, when executing the program, implements the method of any of claims 1-9.

14. A computer-readable medium, having stored thereon a computer program,

the program when executed by a processor implementing the method according to any one of claims 1-9.