CN103390159A

CN103390159A - Method and device for converting screen character into voice

Info

Publication number: CN103390159A
Application number: CN2013103066970A
Authority: CN
Inventors: 罗骁
Original assignee: China Security and Fire Technology Co Ltd
Current assignee: China Security and Fire Technology Co Ltd
Priority date: 2013-07-19
Filing date: 2013-07-19
Publication date: 2013-11-13

Abstract

The invention is applicable to the field of artificial intelligence, and provides a method and a device for converting a screen character into a voice. The method comprises the steps as follows: a screenshot is formed from a content of a to-be-converted picture on a terminal screen; the content of the screenshot is converted into a character content; and the character content is converted into a voice signal. The device comprises a screenshot module, a character recognition module and a sound converting module, wherein the screenshot module is used for screenshot of the to-be-converted picture content on the terminal screen; the character recognition module is used for converting the screenshot content into the character content; and the sound converting module is used for converting the character content into the voice signal. According to the method and the device, a voice recognition technology and an optical character recognition technology are combined, characters on the screen are converted into voices on a terminal device, and the voices are read and output.

Description

Screen text is converted into method and the device of voice

Technical field

The invention belongs to artificial intelligence field, relate in particular to a kind of method and device that screen text is converted into voice.

Background technology

Intelligent terminal is very universal in the middle of people live at present, for each side such as people's Working Lifes, brings various informations, has greatly enriched the scope of people's acquisition of informations.The intelligent terminal function is increasing, and a lot having text conversion is voice output function, but is all in application-specific, and operation is very inconvenient.

Application software (as E-book reader) on a lot of PC and smart mobile phone is arranged at present, can reasonablely realize the function that the page word content in software is read aloud, the word that perhaps will choose is bright to read out.The implementation of this class application software is generally that event is read aloud in triggering in application, obtains the word that need to read aloud, then word is carried out speech conversion, by loudspeaker, reads.This class software systems major defect is:, based on the conversion of word content, any word in the picture of seeing on screen can't be changed, its conversion can only be the content of text formatting, make inconvenient operation, user experience is not high.

Summary of the invention

The purpose of the embodiment of the present invention is to provide a kind of method and apparatus that screen text is converted into voice, and being intended to solve existing intelligent terminal can not transform the problem of voice for the user provides image content.

The embodiment of the present invention is achieved in that a kind of method that screen text is converted into voice, and described method comprises:

Intercepting image content to be transformed on terminal screen;

The image content of described intercepting is converted to word content;

Described word content is converted to voice signal.

Further, described on terminal screen the intercepting image content to be transformed comprise:

After receiving the event that triggers conversion, eject mask layer;

Draw and get a picture region in the zone of described mask layer;

Image content in described picture region is saved as bitmap object.

Further, the described image content that will intercept is converted to word content and comprises:

, according to described bitmap object and default optical character recognition algorithms, obtain the Word message in described bitmap object;

Described text-string is carried out the analysis of syntax and semantics, obtain text-string.

Further, describedly word content be converted to voice signal comprise:

, according to described text-string and default speech recognition engine, generate voice signal corresponding to described text-string.

Further, described method also comprises:

Play described voice signal.

The present invention also proposes a kind of device that screen text is converted into voice, and described device comprises:

The picture interception module, be used for intercepting image content to be transformed on terminal screen;

Character recognition module, be used for the image content of described intercepting is converted to word content;

The sound modular converter, be used for described word content is converted to voice signal.

Further, described picture interception module comprises:

Trigger element, after being used for receiving the event that triggers conversion, eject mask layer;

Draw and get unit, be used for drawing and getting a picture region in the zone of described mask layer;

Storage unit, be used for the image content in described picture region is saved as bitmap object.

Further, described character recognition module comprises:

Acquiring unit, be used for obtaining the Word message in described bitmap object according to described bitmap object and default optical character recognition algorithms;

Analytic unit, be used for described text-string is carried out the analysis of syntax and semantics, obtains text-string.

Further, described sound modular converter specifically is used for:

Further, also comprise:

The voice output module, be used for playing described voice signal.

In embodiments of the present invention, terminal is by after the lock-screen picture, choose and need transition region, the technology such as combined with intelligent sectional drawing, OCR identification, speech conversion, thereby realize being voice output with the text conversion in picture, be particularly suitable for view obstacle person and use, and easy and simple to handle, improved user experience.

Description of drawings

Fig. 1 be the embodiment of the present invention one provide screen text is converted into the process flow diagram of the method for voice;

Fig. 2 be the embodiment of the present invention two provide screen text is converted into the structural drawing of the device of voice;

Fig. 3 is the structural drawing of picture interception module in the device that screen text is converted into voice that provides of the embodiment of the present invention two;

Fig. 4 is the structural drawing of character recognition module in the device that screen text is converted into voice that provides of the embodiment of the present invention two.

Embodiment

, in order to make purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.

Embodiment one

The embodiment of the present invention one proposes a kind of method that screen text is converted into voice.As shown in Figure 1, the method for the embodiment of the present invention one comprises step:

Step S1, on terminal screen intercepting image content to be transformed;

The embodiment of the present invention one can arrange trigger button in advance on terminal screen,, as toolbar or one and half sidebar of hiding of floating, can be positioned at the superiors of screen, is used for triggering the event that transforms.After event is triggered, the mask layer that occurs one transparent (transparency can arrange as required) in the screen the superiors, this mask layer is used for locking current screen interface, original interface will no longer receive other customer incident this moment, and above-mentioned mask layer can receive to touch and chooses or mouse drag such as chooses at the Action Events.

While delimiting a zone on mask layer as the user, terminal operates on mask layer and draws and get a picture region according to the user, and this picture region can be to touch the zone of sliding and covering, and can be also the zone of choosing choice box.The range computation that terminal is slided according to gesture goes out a rectangular area, records wide, the high attribute of rectangular area, and the starting point relative coordinate.

Terminal creates an interim bitmap object IMG, and the rectangular area of choosing is drawn up, and the rectangular area that can will choose intercepts, and interim bitmap object IMG is stored in cached bitmap formation CacheList.

Step S2, the image content that will intercept are converted to word content;

Take out the bitmap object IMG of intercepting from the formation of buffer memory bitmap object, carry out text identification conversion (can quote third party library), obtain the Word message in bitmap object, then carry out grammer, semantic analysis, obtain final text-string.In this step, the optical character recognition algorithms that the text identification conversion using is default, optical character identification (OCR, Optical Character Recognition) refer to the text information in image is scanned, then image file is carried out analyzing and processing, obtain the process of word and layout information, can fast the word in image be converted into text-string.

Step S3, word content is converted to voice signal.

Call default speech recognition engine (as third party library, the scheme of a lot of maturations is arranged, even built-in in the certain operations system is the functional module of voice with text conversion) text-string is converted to corresponding voice signal.Again voice signal is exported by loudspeaker, realized the sound broadcast.

The embodiment of the present invention one combines speech recognition technology and optical character recognition, realizes on intelligent terminal, the text conversion on screen being become massage voice reading output.It, by after the lock-screen picture, is chosen and needs transition region, the technology such as combined with intelligent sectional drawing, OCR identification, speech conversion, thereby realize being voice output with the text conversion in picture, be particularly suitable for view obstacle person and use, and easy and simple to handle, improved user experience.

, for further illustrating the method for the embodiment of the present invention one, with following two examples, the embodiment of the present invention are described:

Example one

Realization is as follows with the implementation step that screen text is converted to massage voice reading on smart mobile phone:

1, after os starting, background service process of initialization, run case is monitored module, the event of response definition at any time.

2, create one on screen and control widget,, as one and half sidebar of hiding, can, by clicking, pull out the button of complete demonstration.

3, button click, eject 90% a transparency mask layer on screen, monitor the touch event of screen on mask layer.

4, slide one when regional on screen when finger, catch finger and lift event, call backstage conversion process algorithm, mainly contained for five steps:

1) will draw and get zone and be calculated as a rectangular area;

2), with the rectangular area intercepting, save as an object picture;

3) object picture is carried out word identification, and filtration treatment becomes text-string;

4) text-string is carried out speech conversion, convert sound to;

5) calling system voice output unit, with voice output;

5, finish dealing with, the touch area display effect is recovered, and waits for touch event next time.

6, click and return, withdraw from status recognition.

Example two

Realization is as follows with the implementation step that screen text is converted to massage voice reading on PC:

2, create one on screen and control widget,, as a translucent toolbar, float in the lower right corner, can, by clicking, pull out the button of complete demonstration.

3, button click (perhaps using shortcut, as Ctrl+Shift+C), eject 90% a transparency mask layer on screen, monitor the ole item OLE of the mouse of screen on mask layer.

4, slide a zone when mouse constantly on current screen, catch the mouse drag and drop and lift event, call backstage conversion process algorithm, mainly contained for five steps:

1) will draw and get zone and be calculated as a rectangular area;

2), with the rectangular area intercepting, save as an object picture;

3) object picture is carried out word identification, and filtration treatment obtains text-string;

4) text-string is carried out speech conversion, convert sound to;

5) calling system voice output unit, with voice output;

5, finish dealing with, the selected areas display effect is recovered, and waits for that mouse drag and drop next time choose event.

6, click and return, withdraw from status recognition.

Embodiment two

The embodiment of the present invention two proposes a kind of device that screen text is converted into voice.The device of the embodiment of the present invention two can be terminal itself, also can be the built-in or external device of terminal, and as shown in Figure 2, the device of the embodiment of the present invention two comprises:

Picture interception module 10, be used for intercepting image content to be transformed on terminal screen;

Character recognition module 20, the image content that is used for intercepting is converted to word content;

Sound modular converter 30, be used for word content is converted to voice signal.

Voice output module 40, be used for playing voice signal.

As shown in Figure 3, picture interception module 10 comprises:

Trigger element 11, after being used for receiving the event that triggers conversion, eject mask layer;

Draw and get unit 12, be used for drawing and getting a picture region in the zone of mask layer;

Storage unit 13, be used for the image content in picture region is saved as bitmap object.

As shown in Figure 4, character recognition module 20 comprises:

Acquiring unit 21, be used for obtaining the Word message in bitmap object according to bitmap object and default optical character recognition algorithms;

Analytic unit 22, be used for text-string is carried out the analysis of syntax and semantics, obtains text-string.

The embodiment of the present invention two can arrange trigger button in advance on terminal screen,, as toolbar or one and half sidebar of hiding of floating, can be positioned at the superiors of screen, is used for triggering the event that transforms.After event is triggered, trigger element 11 ejects the mask layer of transparent (transparency can arrange as required) in the screen the superiors, this mask layer is used for locking current screen interface, original interface will no longer receive other customer incident this moment, and above-mentioned mask layer can receive to touch and chooses or mouse drag such as chooses at the Action Events.

Delimit when zone as the user on mask layer, draw and get unit 12 and operate on mask layer and draw and get a picture region according to the user, this picture region can be to touch the zone of sliding and covering, and can be also the zone of choosing choice box.The range computation that terminal is slided according to gesture goes out a rectangular area, records wide, the high attribute of rectangular area, and the starting point relative coordinate.

Storage unit 13 creates an interim bitmap object IMG, and the rectangular area of choosing is drawn up, and the rectangular area that can will choose intercepts, and interim bitmap object IMG is stored in cached bitmap formation CacheList.

The image content that character recognition module 20 will intercept is converted to word content.Acquiring unit 21 takes out the bitmap object IMG of intercepting from the formation of buffer memory bitmap object, carry out text identification conversion (can quote third party library), obtain the Word message in bitmap object, then analytic unit 22 carries out grammer, semantic analysis, obtains final text-string.Acquiring unit 21 can utilize default optical character recognition algorithms, this optical character identification refers to the text information in image is scanned, then image file is carried out analyzing and processing, obtain the process of word and layout information, can fast the word in image be converted into text-string.

Sound modular converter 30 is converted to voice signal with word content.Sound modular converter 30 calls default speech recognition engine (as third party library, the scheme of a lot of maturations is arranged, even built-in in the certain operations system is the functional module of voice with text conversion) text-string is converted to corresponding voice signal.Voice output module 40, again with voice signal output, realizes the sound broadcast.

The device of the embodiment of the present invention two combines speech recognition technology and optical character recognition, realizes on intelligent terminal, the text conversion on screen being become massage voice reading output.It, by after the lock-screen picture, is chosen and needs transition region, the technology such as combined with intelligent sectional drawing, OCR identification, speech conversion, thereby realize being voice output with the text conversion in picture, be particularly suitable for view obstacle person and use, and easy and simple to handle, improved user experience.

The foregoing is only preferred embodiment of the present invention,, not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims

1. a method that screen text is converted into voice, is characterized in that, described method comprises:

Intercepting image content to be transformed on terminal screen;

The image content of described intercepting is converted to word content;

Described word content is converted to voice signal.

2. the method for claim 1, is characterized in that, described on terminal screen the intercepting image content to be transformed comprise:

After receiving the event that triggers conversion, eject mask layer;

Draw and get a picture region in the zone of described mask layer;

Image content in described picture region is saved as bitmap object.

3. method as claimed in claim 2, is characterized in that, the described image content that will intercept is converted to word content and comprises:

4. method as claimed in claim 3, is characterized in that, describedly word content is converted to voice signal comprises:

5. method as described in any one in claim 1 to 4, is characterized in that, described method also comprises:

Play described voice signal.

6. a device that screen text is converted into voice, is characterized in that, described device comprises:

7. device as claimed in claim 6, is characterized in that, described picture interception module comprises:

8. device as claimed in claim 7, is characterized in that, described character recognition module comprises:

9. device as claimed in claim 8, is characterized in that, described sound modular converter specifically is used for:

10. device as described in any one in claim 6 to 9, is characterized in that, also comprises:

The voice output module, be used for playing described voice signal.