WO2022088946A1 - 一种弯曲文本的字符选择方法、装置和终端设备 - Google Patents

一种弯曲文本的字符选择方法、装置和终端设备 Download PDF

Info

Publication number
WO2022088946A1
WO2022088946A1 PCT/CN2021/115904 CN2021115904W WO2022088946A1 WO 2022088946 A1 WO2022088946 A1 WO 2022088946A1 CN 2021115904 W CN2021115904 W CN 2021115904W WO 2022088946 A1 WO2022088946 A1 WO 2022088946A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
picture
coordinates
text
characters
Prior art date
Application number
PCT/CN2021/115904
Other languages
English (en)
French (fr)
Inventor
滕益华
洪芳宇
施烈航
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022088946A1 publication Critical patent/WO2022088946A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0481Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
    • G06F3/0483Interaction with page-structured environments, e.g. book metaphor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/04845Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range for image manipulation, e.g. dragging, rotation, expansion or change of colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/40Filling a planar surface by adding surface attributes, e.g. colour or texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text

Definitions

  • the present application belongs to the technical field of artificial intelligence, and in particular relates to a character selection method, device and terminal device for curved text.
  • OCR Optical Character Recognition
  • CV Computer Vision
  • AR Augmented Reality
  • OCR can include two steps, namely text area detection and text content recognition. The former step can detect where is the text area in the image, and the latter step can identify the specific content of the text in the text area.
  • the shape of the text is often not horizontal in many cases, and its shape may be a curved text in an arc shape, or it may be a curved text in a wavy shape.
  • the scene of curved text brings great challenges to the detection and recognition of OCR.
  • the smart lens can complete multi-object detection and text detection and recognition tasks, and will provide the location of objects and text boxes.
  • the user can click the text line on the phone screen, and the screen will be fixed at this time. Click on the detected and recognized text content, the effect is similar to the function of dragging and selecting text content with the mouse character by character in a text document. For the clicked content, the user can choose follow-up operations such as copying, translating or searching.
  • Text selection scenarios are divided into straight text scenarios and curved text scenarios.
  • the text lines are all angled rectangular boxes, parallelogram boxes or quadrilateral boxes, and the text lines are generally described by four vertices.
  • the text box is generally composed of a polygon, and the polygon can be decomposed into many quadrilaterals, and these quadrilaterals are spliced into the polygon.
  • the coordinates of each text on the image to be recognized are obtained, the coordinates of each text in the straight text on the original image can be easily obtained. .
  • the curved text scene due to the strong discontinuity and the complex description of the curved text itself, it is much more difficult to obtain the coordinates of each text than in the straight scene.
  • Embodiments of the present application provide a character selection method, device and terminal device for curved text, which are used to accurately obtain the coordinates of each character in the curved text, thereby achieving more accurate character selection.
  • an embodiment of the present application provides a character selection method for curved text, which is applied to a terminal device, and the method includes:
  • the display interface of the terminal device displays the original picture, and the original picture contains curved text; then the terminal device detects the original picture, and generates a to-be-recognized picture containing straight text, wherein the text content of the straight text is the same as the curved text.
  • connection temporal classification connectionist temporal classification
  • CTC connectionist temporal classification
  • the terminal device recognizes and obtains a connection temporal classification (connectionist temporal classification, CTC) sequence corresponding to the text content of the straight text according to the to-be-recognized picture, wherein the connection temporal classification
  • the sequence includes a plurality of characters; then the terminal device calculates the first coordinate of each character in the to-be-recognized picture in the plurality of characters in the CTC sequence; then the terminal device determines the first coordinate of each character in the plurality of characters The segmented area where the coordinates are located in the to-be-recognized picture; then the terminal device determines, according to the original picture and the to-be-identified picture, the perspective transformation matrix corresponding to each segmented area when the original image is processed and transformed into the to-be-recognized picture; Then multiply the first coordinate of each character in the plurality of characters by the perspective transformation matrix to obtain the second coordinate of each character in the plurality of
  • the character selection method for curved text provided by the embodiment of the present application calculates the coordinates of each character in the text content in the CTC sequence corresponding to the text content, and then calculates the correspondence between the CTC sequence index and the text coordinates of the to-be-recognized picture.
  • the relationship obtains the first coordinate of each character in the picture to be recognized, and then according to the segmental perspective transformation relationship between the picture to be recognized and the original picture, the first coordinate is calculated one-to-one to obtain each character in the original picture.
  • the terminal device can Accurately locate characters, output the selected characters, and improve the efficiency and accuracy of character selection and recognition.
  • the OCR detection model configured on the terminal device can detect the original image, and generate a to-be-recognized image containing the text content of the curved text in the original image.
  • the above-mentioned to-be-recognized image can be used as the input data of the OCR recognition model.
  • the character selection method for curved text provided by the embodiment of the present application can directly use the OCR detection model and OCR recognition model with the curved text detection capability configured in the terminal device, which helps to expand the application scope of the method and reduces the use of this method by the terminal device.
  • the technical difficulty of the method can directly use the OCR detection model and OCR recognition model with the curved text detection capability configured in the terminal device, which helps to expand the application scope of the method and reduces the use of this method by the terminal device. The technical difficulty of the method.
  • the terminal device may also use the second coordinate corresponding to each character in the plurality of characters.
  • Two coordinates, first prompt information is generated in the original picture, and the first prompt information is used to indicate that the user can select characters in the original picture. This makes it easier for users to select characters.
  • the CTC sequence corresponding to the text content may refer to the sequence after processing the initial CTC sequence output by the OCR recognition model, and the processing of the initial CTC sequence may be implemented by the character selection model provided in the embodiment of the present application.
  • the above text content is recognized by the OCR recognition model, and the initial CTC sequence can be output.
  • the character selection model can determine the length of the initial CTC sequence and the picture width of the picture to be recognized, and determine whether the product between the length of the initial CTC sequence and the downsampling multiple of the character selection model is greater than the picture width of the picture to be recognized.
  • the initial CTC sequence needs to be cropped to a certain extent, so that the product between the length of the cropped CTC sequence and the downsampling multiple of the model is less than or equal to the picture width of the picture to be recognized.
  • the clipping of the CTC sequence may be performed by clipping the head element or the tail element of the initial CTC sequence in sequence.
  • the initial CTC sequence that needs to be clipped one head element of the sequence can be clipped first, and then a tail element can be clipped. After each clipping a header element or a tail element, it may be calculated again whether the product of the length of the clipped sequence and the preset downsampling multiple is already smaller than or equal to the picture width of the picture to be identified.
  • the cropping can be stopped and the currently obtained CTC sequence is output, which is the same as the The CTC sequence corresponding to the text content. If the product of the length of the sequence obtained after a certain cropping and the preset downsampling multiple is still larger than the image width of the image to be recognized, it is necessary to continue cropping in the above order until the product of the sequence length and the downsampling multiple is less than or equal to the to-be-recognized image. Identify the image width of the image.
  • the amount of data for subsequent processing can be reduced, and the processing efficiency can be improved.
  • the parameters may be initialized.
  • Parameters that need to be initialized can include left border, bounded, current state, previous state, coordinate array, content array, and so on.
  • calculating the first coordinates of the first coordinates of each character in the CTC sequence in the to-be-recognized picture may be accomplished by determining the left and right border coordinates of each character in the CTC sequence. Combining the coordinates of the left boundary and the right boundary of each character in the CTC sequence, the first coordinate of the character in the picture to be recognized can be obtained.
  • the character boundary of each character in the CTC sequence can be obtained by determining the right boundary of the character and the left boundary of the adjacent next character.
  • the original right boundary coordinates of the character and the original left boundary coordinates of the next character can be obtained first, and then the average value of the above original right boundary coordinates and the original left boundary coordinates can be calculated. After fine-tuning, you can get the coordinates of the right border of the character and the coordinates of the left border of the next character.
  • fine-tuning the boundaries of each character may be performed according to different character types. Therefore, when fine-tuning the character boundary, the first character type of the character and the second character type of the next character can be determined respectively. If the first character type and the second character type are different, they have different biases respectively. The offset, if the two character types are the same, the offset is also the same.
  • the first difference value can be obtained by calculating the above average value minus the corresponding offset of the first character type of the current character, and then using the first difference value as the fine-tuned right boundary coordinate of the current character;
  • the second sum value can be obtained by calculating the average value and the offset corresponding to the second character type, and then using the second sum value as the fine-tuned left boundary coordinate of the next character.
  • the embodiment of the present application fine-tunes the boundary of each character in the CTC sequence, which can prevent the subsequent processing process. , when a character is clicked, the character selection space generated according to the second coordinate cannot completely cover the problem of the character, which ensures the accuracy of the character positioning position.
  • each element in the CTC sequence is downsampled by a certain multiple of the convolutional neural network in the OCR recognition model from the image to be recognized, and a sliding window is performed on the feature map of the last layer through a 4 ⁇ 4 convolution kernel.
  • an element in the CTC sequence corresponds to the range covered by the downsampling multiple of the pixel width in the picture to be recognized, and then, when calculating the first coordinate of each character in the picture to be recognized according to the character boundary coordinates of each character , the left and right border coordinates of each character can be multiplied by the preset downsampling multiples to obtain the left border position and right border position of each character in the image to be recognized, and then according to the left border position and The right border position obtains the first vertex coordinates and the second vertex coordinates of the character in the image to be recognized.
  • the third vertex coordinate and the fourth vertex coordinate of each character in the to-be-recognized image can be determined, that is, the four vertices of each character in the to-be-recognized image can be obtained.
  • coordinate Each character is divided according to a rectangular frame in the to-be-recognized picture, each character occupies a rectangular frame, and the four vertex coordinates are used to describe the rectangular frame where each character is located in the to-be-recognized picture.
  • the terminal device can traverse the character, and then respectively determine where the coordinates of the first vertex are located.
  • the segmented area is used to indicate a segmented perspective transformation area corresponding to the image to be identified and the original image.
  • the coordinates of the first vertex of a character are located in the rectangular frame A in the image to be recognized, and the rectangular frame A corresponds to the area A in the original image, and the coordinates of the second vertex are located in the rectangular frame B in the image to be recognized.
  • the B rectangular frame corresponds to the B area in the original film, so when the coordinates of the character in the image to be recognized are mapped to the original image, the first vertex coordinates are multiplied by the perspective transformation matrix between the A rectangular frame and the A area.
  • the coordinates of the first vertex coordinates in the original image are obtained, and the coordinates of the second vertex coordinates in the original image are obtained by multiplying the second vertex coordinates by the perspective transformation matrix between the B rectangular frame and the B region.
  • the terminal device obtains the third coordinate by multiplying the first vertex coordinate by the perspective transformation matrix of the first segmented area, and converts the The fourth coordinate is obtained by multiplying the second vertex coordinate by the perspective transformation matrix of the second segmented area, and the fifth coordinate is obtained by multiplying the third vertex coordinate by the perspective transformation matrix of the third segmented area.
  • the fourth vertex coordinate is multiplied by the perspective transformation matrix of the fourth segment area to obtain the sixth coordinate; wherein, the third coordinate, the fourth coordinate, the fifth coordinate and the sixth coordinate as the second coordinate of the character in the original picture.
  • the segmented area where the coordinates of the four vertices of each character are located is determined, and the second coordinate is obtained according to the perspective transformation matrix of the corresponding segmented area, which improves the calculation accuracy of the first coordinate and ensures that when selecting a certain character, the character selection control can completely cover the area where the character is located.
  • character selection controls can be displayed with different background colors.
  • the character area of each character can be drawn as the first color according to the second coordinate.
  • the first color may be any color, but it should be noted that after the character area is drawn as the first color, inconvenience for the user to recognize the characters due to the first color should be avoided.
  • by generating a character selection control in the original picture the user can be prompted which regions are recognized text regions, so that the user can click characters in the region.
  • the terminal device can monitor the user's click event in the character area in real time, and determine whether the user's click is in the text box.
  • the terminal device can redraw the background color of the area to distinguish other areas that have not been clicked.
  • it will not respond.
  • the entire line of character areas including the clicked character area may be drawn as a second color that is different from the first color. Then continue to monitor whether there is a drag event in the character area corresponding to the second color. If there is, the character area covered by the second color can be adjusted according to the drag event, and the adjusted character area is the character area that the user expects to click. For each character in the character area covered by the adjusted second color, the result recognized by the OCR recognition model can be output, and the result can be displayed on the display interface of the terminal device.
  • the area containing the clicked character (that is, the area where the character closest to the clicked position is located) may be drawn as a second color different from the first color. Then continue to monitor whether there is a drag event in the character area corresponding to the second color. If there is, the character area covered by the second color can be adjusted according to the drag event, and the adjusted character area is the character area that the user expects to click. For each character in the character area covered by the adjusted second color, the result recognized by the OCR recognition model can be output, and the result can be displayed on the display interface of the terminal device.
  • the character area that the user slides and clicks on the screen may be drawn as a second color different from the first color. Then continue to monitor whether there is a drag event in the character area corresponding to the second color. If there is, the character area covered by the second color can be adjusted according to the drag event, and the adjusted character area is the character area that the user expects to click. For each character in the character area covered by the adjusted second color, the result recognized by the OCR recognition model can be output, and the result can be displayed on the display interface of the terminal device.
  • the drag event can be understood as that after the terminal device draws the background color of the second color for the selected text box, draggable left and right controls will appear at the beginning and end of the text box, and the user can drag and drop the left and right controls. Move the two controls to modify the text selection range.
  • the user by drawing prompt information of the first color in the original picture, the user can be prompted which characters in the region are available for selection.
  • the entire text area in the corresponding area can be drawn as a second color according to different click positions, so as to inform the user of the content of the characters currently selected.
  • the user can adjust the content of the selected character by dragging in the character area drawn with the second color.
  • the embodiment of the present application improves the operational convenience of character selection by implementing interaction with the user on the terminal interface.
  • an embodiment of the present application provides a character selection apparatus, which has the function of implementing the behavior of the terminal device in the first aspect above.
  • This function can be implemented by hardware or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the apparatus includes units or modules for performing the steps of the above first aspect.
  • the device includes: a display module for displaying an original picture on a display interface of a terminal device, where the original picture contains curved text; a processing module for detecting the original picture and generating a to-be-identified image containing straight text Picture, the text content of the straight text is in one-to-one correspondence with the text content of the curved text; according to the to-be-recognized picture, the text content of the straight text is recognized, and the text content of the straight text is obtained.
  • connection sequence classification sequence includes a plurality of characters; calculate the first coordinates of each character in the plurality of characters in the to-be-recognized picture; determine each character in the plurality of characters.
  • the processing module is further configured to generate first prompt information in the original picture according to the second coordinate corresponding to each character in the plurality of characters, and the first prompt information is used to indicate The user can select characters in the original picture.
  • the processing module is specifically configured to identify the text content of the straight line text according to the to-be-recognized picture, and obtain an initial connection sequence classification sequence; determine the length of the initial connection sequence classification sequence, and determine the length of the initial connection sequence classification sequence.
  • the processing module is specifically configured to cut the head element or tail element of the initial connection time sequence classification sequence in sequence; after cutting any head element or tail element, calculate the value of the cut initial connection time sequence classification sequence. Whether the product of the length and the preset downsampling multiple is less than or equal to the picture width of the to-be-identified picture; The picture width of the picture is stopped, and the connection time series classification sequence corresponding to the text content of the straight line text is output.
  • the processing module is specifically configured to determine the character boundary coordinates of each of the plurality of characters in the connection sequence classification sequence, where the character boundary coordinates include left boundary coordinates and right boundary coordinates; according to the The character boundary coordinates of each of the plurality of characters are calculated, and the first coordinates of each of the plurality of characters in the to-be-recognized picture are calculated.
  • the processing module is specifically configured to obtain the original right boundary coordinates of the character and the original left boundary coordinates of the next character for any character in the connection sequence classification sequence; calculate the original right boundary coordinates and the average value of the original left border coordinates; based on the average value, determine the right border coordinates of the character, and the left border coordinates of the next character.
  • the processing module is specifically configured to respectively determine the first character type of the character and the second character type of the next character, and the first character type and the second character type respectively have corresponding offset; calculate the average value minus the offset corresponding to the first character type to obtain a first difference value, and use the first difference value as the right boundary coordinate of the character; calculate the average value The offset corresponding to the second character type is added to obtain a second sum value, and the second sum value is used as the left boundary coordinate of the next character.
  • the first coordinate of each character in the picture to be recognized includes the first vertex coordinate, the second vertex coordinate, the third vertex coordinate and the fourth vertex coordinate;
  • the left boundary coordinates and the right boundary coordinates of each character in the plurality of characters are respectively multiplied by a preset downsampling multiple to obtain the first vertex coordinates and second vertex coordinates; according to the first vertex coordinates, the second vertex coordinates and the picture height of the picture to be recognized, determine the third vertex coordinates and the Fourth vertex coordinates.
  • the processing module is specifically configured to traverse the first coordinates of each character and the segmented area of the to-be-recognized picture, and respectively determine the first segmented area where the coordinates of the first vertex are located, and the The second segmented area where the second vertex coordinates are located, the third segmented area where the third vertex coordinates are located, and the fourth segmented area where the fourth vertex coordinates are located.
  • the processing module is specifically configured to multiply the coordinates of the first vertex by the perspective transformation matrix of the first segmented area to obtain a third coordinate, and multiply the coordinates of the second vertex and the second segment.
  • the fourth coordinate is obtained by multiplying the perspective transformation matrix of the segment area
  • the fifth coordinate is obtained by multiplying the third vertex coordinate with the perspective transformation matrix of the third segment area
  • the fourth vertex coordinate is
  • the sixth coordinate is obtained by multiplying the perspective transformation matrix of the four-segment area; wherein, the third coordinate, the fourth coordinate, the fifth coordinate and the sixth coordinate are used as the first coordinate of the character in the original picture. two coordinates.
  • the processing module is specifically configured to draw the character area of each character in the plurality of characters in the original picture as the first coordinate according to the second coordinate corresponding to each character in the plurality of characters. color; or, according to the second coordinate corresponding to each of the plurality of characters, drawing a text box in the original picture for the character area of each of the plurality of characters.
  • the processing module is further configured to draw the entire line of character areas including the clicked character area as a second color when monitoring the click event of the user in the character area; A drag event of the character area corresponding to the second color, adjusting the character area covered by the second color according to the drag event;
  • the processing module is further configured to draw the character area closest to the click position indicated by the click event as the second color when the click event of the user in the character area is monitored; For the drag event of the character area corresponding to the second color, adjust the character area covered by the second color according to the drag event; identify and display each character in the character area covered by the second color.
  • the processing module is further configured to draw the character area containing the indication of the sliding event as a second color when monitoring the sliding event of the user in the character area;
  • the drag event of the character area adjust the character area covered by the second color according to the drag event; identify and display each character in the character area covered by the second color.
  • it also includes a storage module for storing necessary program instructions and data of the character selection device.
  • an obtaining module is also included for obtaining the original picture.
  • the apparatus includes: a processor and a transceiver, where the processor is configured to support the character selection apparatus to perform corresponding functions in the method provided in the first aspect.
  • the transceiver is used for instructing the communication between the character selection device and other devices in the communication system, such as sending the selected characters to other devices.
  • the apparatus may further include a memory for coupling with the processor, which stores program instructions and data necessary for the character selection apparatus.
  • the character selection device when the character selection device is a chip in the character selection device, the chip includes: a processing module and a transceiver module.
  • the transceiver module may be, for example, an input/output interface, a pin or a circuit on the chip.
  • the processing module can be, for example, a processor, which is used to detect the original picture and generate a picture to be recognized that includes straight text, and the text content of the straight text corresponds to the text content of the curved text one-to-one; the picture to be recognized, identify the text content of the straight line text, and obtain a connection time sequence classification sequence corresponding to the text content of the straight line text, where the connection time sequence classification sequence includes a plurality of characters; calculate the The first coordinate of each character in the plurality of characters in the picture to be recognized; determining the segmented area where the first coordinate corresponding to each character in the plurality of characters is located; according to the original picture and the Describe the picture to be recognized, determine the perspective transformation matrix corresponding to each segmented area when the original picture is transformed into the picture to be recognized; compare the first coordinate of each character in the plurality of characters with the perspective transformation matrix Multiply to obtain the second coordinates of each of the multiple characters in the original picture; the first operation performed by the user on the original picture is detected, and the first operation
  • the processing module can execute the computer-executed instructions stored in the storage unit, so as to support the character selection apparatus to execute the method provided in the first aspect.
  • the storage unit can be a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit can also be a storage unit located outside the chip, such as a read-only memory (read-only memory, ROM) or a memory unit.
  • RAM random access memory
  • the character selection device includes: a processor, a radio frequency circuit and an antenna.
  • the processor is used to control the function of each circuit part and implement the method of the first aspect.
  • the radio frequency circuit can perform analog conversion, filtering, amplification, and up-conversion processing on the information to be sent generated by the processor, and then send it to other devices in the communication system via the antenna.
  • the device further includes a memory, which stores necessary program instructions and data for the character selection device.
  • the device includes a communication interface and a logic circuit, the logic circuit is configured to detect the original picture, and generate a picture to be recognized that includes straight text, the text content of the straight text is related to the curved line
  • the text content of the text is in one-to-one correspondence; according to the to-be-recognized picture, the text content of the straight line text is identified, and a connection time sequence classification sequence corresponding to the text content of the straight line text is obtained, and the connection time sequence classification sequence is obtained.
  • the sequence includes multiple characters; calculate the first coordinate of each character in the multiple characters in the to-be-recognized picture; determine the point at which the first coordinate corresponding to each of the multiple characters is located; segment area; according to the original picture and the to-be-recognized picture, determine the perspective transformation matrix corresponding to each segmented area when the original picture is transformed into the to-be-recognized picture; The first coordinate is multiplied by the perspective transformation matrix to obtain the second coordinate of each character in the plurality of characters in the original picture; the first operation of the user on the original picture is detected, and the first operation on the original picture is detected. An operation is used to select characters in the curved text on the original picture.
  • the processor mentioned in any of the above may be a general-purpose central processing unit (CPU), a processor, an application-specific integrated circuit (ASIC), or one or more An integrated circuit for program execution for controlling a character selection method for curved text in the above-mentioned aspects.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, when the processor executes the computer program
  • a terminal device including a memory, a processor, and a computer program stored in the memory and executable on the processor, when the processor executes the computer program
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor of a terminal device, the above-mentioned first aspect is implemented Character selection method for curved text.
  • an embodiment of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the character selection method for curved text described in the first aspect above.
  • a CTC sequence output by the detection and recognition model is established.
  • the mapping relationship between the index and the coordinates of the original image, the model used in the whole process is simple, and the amount of data that needs to be marked is small; when processing, compared with the existing technology, it is necessary to provide the original image or the corrected image, or it is necessary to traverse the image.
  • some embodiments of the present application can use real-time detection or collected pictures, the calculation speed is fast, and it can be quickly and easily adapted and deployed into the existing OCR model.
  • some embodiments of the present application when determining the character boundary, can set hyperparameters according to different languages to fine-tune the character boundary to ensure the accuracy of character boundary recognition, which is not only applicable to most languages, but also supports the selection of symbols.
  • the coordinates of each character in the original image are obtained by recording the coordinates of the four vertices of each character in the image to be recognized, and mapping the coordinates of the four vertices to the original image according to the segmented area. , which makes the granularity of the coordinate output finer.
  • the character selection methods provided by some embodiments of the present application are also more robust to character location in natural scenes.
  • Fifth, in some embodiments of the present application by associating the recognized coordinates of each character with the coordinates of each character in the original picture (also referred to as the second coordinates in this document for the convenience of distinction), and drawing corresponding coordinates according to the second coordinates.
  • the recognized text area can support the user to make arbitrary selection and range adjustment in the above recognized text area.
  • the above text recognition and selection scenarios can also be associated with functions such as translation and question scanning. After the user selects, the selected text can be directly translated, or the identified question can be answered; or, by parsing the string selected by the user. , if it contains text content in the format of telephone and mailbox, a separate card can also be extracted, which is convenient for users to make calls or send emails directly, which has strong ease of use and practicability.
  • Figure 1a to Figure 1b are scene schematic diagrams of a text click scene
  • FIGS. 2a to 2c are schematic diagrams of recognition results of a text-clicking scheme for curved text in the prior art
  • FIG. 3 is a schematic diagram of the correspondence between a CTC sequence in the prior art and a character recognition or speech recognition result
  • FIG. 4 is a schematic diagram of a process flow and a logical relationship between the OCR detection recognition model and the character selection model in the embodiment of the application;
  • FIG. 5 is a schematic diagram of a detection processing process for curved text in an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a processing flow for converting curved text into straight text in an embodiment of the present application
  • FIG. 7 is a schematic diagram of an overall processing process for obtaining text coordinates on a picture to be recognized in an embodiment of the application
  • FIG. 8 is a schematic flowchart of an algorithm for obtaining text coordinates on a picture to be recognized in an embodiment of the present application
  • FIG. 9 is a schematic diagram of the hardware structure of a mobile phone to which the character selection method for curved text in an embodiment of the present application is applicable;
  • FIG. 10 is a schematic diagram of a software structure of a mobile phone to which the character selection method for curved text in the embodiment of the application is applicable;
  • FIG. 11 is a schematic diagram of an embodiment of a character selection method for curved text in an embodiment of the present application.
  • FIG. 12 is a schematic diagram of the display effect of the coordinates of each character on the to-be-recognized picture according to an embodiment of the present application
  • FIG. 13 is a schematic diagram of the effect of combining a coordinate display effect on a picture to be recognized and a rectangular segment in an embodiment of the present application;
  • FIGS. 14a to 14b are schematic diagrams of an embodiment of a character selection control provided by an embodiment of the present application.
  • FIG. 15 is a schematic diagram of another embodiment of a character selection method for curved text in an embodiment of the present application.
  • 16a to 16c are schematic diagrams showing an example effect when a character is selected by a character selection method for curved text according to an embodiment of the present application
  • 17 is a schematic diagram of outputting the selected target character after the user selects the target character in the embodiment of the present application.
  • FIG. 18 is a schematic diagram of a plurality of application selection interfaces in an embodiment of the present application.
  • FIG. 19 is an exemplary structural block diagram of a character selection device in an embodiment of the present application.
  • FIG. 20 is another exemplary structural block diagram of the character selection apparatus in the embodiment of the present application.
  • the naming or numbering of the steps in this application does not mean that the steps in the method flow must be executed in the time/logical sequence indicated by the naming or numbering, and the named or numbered process steps can be implemented according to the The technical purpose is to change the execution order, as long as the same or similar technical effects can be achieved.
  • the division of units in this application is a logical division. In practical applications, there may be other division methods. For example, multiple units may be combined or integrated into another system, or some features may be ignored. , or not implemented, in addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, and the indirect coupling or communication connection between units may be electrical or other similar forms. There are no restrictions in the application.
  • the units or sub-units described as separate components may or may not be physically separated, may or may not be physical units, or may be distributed into multiple circuit units, and some or all of them may be selected according to actual needs. unit to achieve the purpose of the scheme of this application.
  • the terminology used in this application is for the purpose of describing particular embodiments only, and is not intended to be limiting of the application.
  • the singular expressions "a,” “an,” “the,” “above,” “the,” and “the” are intended to also Expressions such as “one or more” are included unless the context clearly dictates otherwise.
  • one or more refers to one, two or more; "and/or”, which describes the association relationship of associated objects, indicates that there may be three kinds of relationships; for example, A and/or B can mean that A exists alone, A and B exist simultaneously, and B exists independently, wherein A and B can be singular or plural.
  • the character “/” generally indicates that the associated objects are an "or" relationship.
  • the text selection scheme of the existing technology is first introduced: in the scene of mobile phone smart lens (such as Google Lens, Baidu smart lens and Huawei HiVsion), only the lens can complete multi-object detection and text in AR scene.
  • Detection and recognition (OCR) task and will provide the object and text box position, the user can click on the text line on the phone screen, the screen will be fixed at this time, the user can click on the detected and recognized text content, the effect is similar to text
  • OCR Detection and recognition
  • the user can click on the text line on the phone screen, the screen will be fixed at this time, the user can click on the detected and recognized text content, the effect is similar to text
  • OCR Detection and recognition
  • the user can click on the text line on the phone screen, the screen will be fixed at this time, the user can click on the detected and recognized text content, the effect is similar to text
  • the user can choose follow-up operations such as copying,
  • the text selection scene is divided into a straight text scene (as shown in Figure 1a) and a curved text scene (as shown in Figure 1b).
  • the text lines are all angled rectangular boxes, parallelogram boxes or quadrilateral boxes, and the text lines are generally described by four vertices.
  • the text box is generally composed of a polygon, which can be decomposed into many quadrilaterals, and these quadrilaterals are spliced into the polygon.
  • FIG. 2a-FIG. 2c it is a schematic diagram of a recognition result of a text selection scheme in the prior art.
  • This point-and-click solution has the following drawbacks:
  • the detection frame has serious overlapping phenomenon
  • the detection frame is the splicing of the straight text frame
  • the selected character is located in multiple straight text boxes, the recognized text may be cut apart at the junction of semantically coherent texts, as shown in Figure 2b.
  • the identification result is "Technology Co-Mining Co., Ltd. Company Consumption”.
  • the highlighted area can be expanded according to the selected character, as shown in Figure 2c, the selected character is "Company Consumer BG wisdom", and the highlighted area can be as shown in Figure 2c. The larger rectangle in 2c, the user visual effect is poor at this time.
  • the embodiment of the present application provides a character selection method for curved text based on the CTC sequence output by the OCR recognition model.
  • the character selection method for curved text mainly uses the output CTC sequence after being recognized by the OCR recognition model, and based on the output CTC sequence, obtains each character for the user to manually select character by character on the screen.
  • Connection timing classification is a loss calculation method, which is mainly used in sequence recognition models such as text recognition and speech recognition.
  • Most sequence recognition models use the structure of CNN + Recurrent Neural Network (RNN) + CTC.
  • RNN Recurrent Neural Network
  • CTC Loss instead of the logistic regression-based loss calculation method (Softmax Loss), the training samples do not need to be aligned.
  • Softmax Loss logistic regression-based loss calculation method
  • FIG. 3 it is a schematic diagram of the corresponding relationship between a CTC sequence and a text recognition or speech recognition result in the prior art.
  • the symbol “ ⁇ ” stands for blank as a placeholder for distinguishing repeated characters. That is, repeated characters between two blanks will be merged into one character.
  • the corresponding text identified is "wwo ⁇ rrr ⁇ lld", after merging the repeated "w" and "r" between the two blanks, the final output can be The recognition result, the word "world”.
  • FIG. 4 it is a schematic diagram of a flow and a logical relationship between an OCR detection and recognition model and a character selection model provided by an embodiment of the present application.
  • the OCR detection model can detect the input picture (that is, the original picture containing curved text), and obtain the polygonal curved text box.
  • the model transforms the curved text box into a to-be-recognized picture containing straight text (the straight text corresponds to the text content of the curved text in the original picture) through piecewise perspective transformation.
  • the above-mentioned input picture may be an image obtained by photographing or scanning through a mobile phone lens, a scanner and other equipment, and the picture to be recognized may be regarded as input data for being provided to the OCR recognition model for processing.
  • the OCR recognition model can output a CTC sequence containing the recognized characters.
  • a schematic diagram of the processing flow of bending the text box may be as shown in FIG. 5 .
  • a in Figure 5 is the curved text in the original picture; then fit the center line as shown in b in Figure 5; further fit the text box and rotation angle as shown in c in Figure 5;
  • the redundant box is removed by non-maximum suppression as shown in d in Figure 5, and smoothing is performed to obtain a curved text box as shown in e in Figure 5.
  • FIG. 6 The specific process of generating straight text in the image to be recognized from the curved text in the original image through perspective transformation may be shown in FIG. 6 .
  • a in Figure 6 is the curved text in the original picture, and then a curved text box as shown in b in Figure 6 is obtained through curved text detection;
  • c in Figure 6 is used to indicate that the curved text box is composed of multiple quadrilaterals composition, and the dotted line is the junction between the quadrilaterals; through perspective transformation, a straight text box as shown in d in Figure 6 is obtained.
  • CNN+RNN+CTC for sequence recognition process.
  • CNN is used for feature extraction
  • RNN+CTC is used for sequence recognition.
  • the recognition model corresponding to the character selection method provided in the embodiment of the present application only uses CNN as the network structure, and directly slides the window on the feature map of the last layer after downsampling by 8 times, thereby obtaining the CTC sequence.
  • the entire recognition model does not Any RNN structure is involved.
  • the CTC sequence output by the OCR detection and recognition model can be used as the input sequence of the character selection model of the embodiment of the present application.
  • the image to be recognized is obtained.
  • FIG. 7 and FIG. 8 it is a schematic diagram of an overall processing process and a schematic flowchart of an algorithm for acquiring text coordinates on a picture to be recognized according to an embodiment of the present application, respectively.
  • the OCR detection and recognition model can automatically detect the text area in the image, and use the CNN network as the network structure of the OCR recognition model, which can be downsampled by 8 times and directly on the feature map of the last layer of CNN. Sliding window, output the CTC sequence of the detection and recognition model.
  • the character selection model After the character selection model cuts the above sequence, it can obtain a CTC sequence that meets the requirements of subsequent processing, that is, the sequence in Figure 7 [bl, bl, cheng, bl, bl, bl , this, bl, bl, bl, meeting, bl, bl, bl, meter, bl, bl, bl, people, bl, bl, bl, member, bl, bl].
  • the character selection model can calculate the index coordinates of different characters for the CTC sequence in Figure 7, that is, the index coordinates of each character of "cost accountant” in the CTC sequence, namely ⁇ [2, 3], [6, 7], [10, 11], [14, 15], [18, 19], [22, 23] ⁇ .
  • the above index coordinates will be mapped back to the to-be-recognized picture (that is, the to-be-recognized picture obtained by the input picture detection), and the coordinates of each character of "cost accountant” in the to-be-recognized picture, that is, the left and right boundaries of each character are obtained.
  • the character selection model can obtain the coordinates of four points of each character in the image to be recognized.
  • each identified text box is drawn with the first color (such as a light color highlighting the background color)
  • the terminal device draws the selected text with a second color (such as a dark highlight background color) after listening to the click event of the user's click in the text box area.
  • the character selection model can listen for user click events on the phone screen.
  • the clicked entire line of text can be set as the selected state, and the line of text is drawn as a dark highlight background.
  • a draggable handle can appear at the beginning and end of the entire line of text, and the user can modify the selected text range by dragging the handle.
  • calculate the current click position select the character closest to the click position and set it as the selected state, draw the character as a dark highlight background, and then click A draggable handle appears on the left and right borders of the character, and the user can modify the selected text range by dragging the handle.
  • the terminal device monitors the user's moving area, calculates the starting position and ending position of the moving area, sets the area contained in the moving area to the selected state, and draws the area as a dark highlight background, At the same time, a draggable handle appears at the beginning and end of the area, and the user can modify the selected text range by dragging the handle.
  • the character selection model can obtain the user's click position from the click event generated by the user dragging the handle, and use the character closest to the position as the currently selected character. After repeating the above steps according to the situation of the user dragging the handle, the text area finally selected by the user can be obtained, and a dark background can be drawn.
  • buttons can also be provided on the text card.
  • the user can click on different buttons to instruct the application program corresponding to the button to perform corresponding operations according to the displayed text. For example, translate the displayed text, search, etc.
  • the character selection method for curved text provided by the embodiments of the present application can be applied to mobile phones, tablet computers, wearable devices, in-vehicle devices, augmented reality (AR)/virtual reality (VR) devices, notebook computers, super On electronic devices such as a mobile personal computer (ultra-mobile personal computer, UMPC), a netbook, and a personal digital assistant (personal digital assistant, PDA), the embodiments of the present application do not impose any restrictions on the specific type of the electronic device.
  • AR augmented reality
  • VR virtual reality
  • notebook computers notebook computers
  • super On electronic devices such as a mobile personal computer (ultra-mobile personal computer, UMPC), a netbook, and a personal digital assistant (personal digital assistant, PDA)
  • PDA personal digital assistant
  • electronic device 900 may have more or fewer components than shown in the figures, may combine two or more components, or may have different component configurations.
  • the various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
  • the electronic device 900 may include: a processor 910, an external memory interface 920, an internal memory 921, a universal serial bus (USB) interface 930, a charge management module 940, a power management module 941, a battery 942, an antenna 1, an antenna 2.
  • SIM Subscriber identification module
  • the sensor module 980 may include a pressure sensor 980A, a gyroscope sensor 980B, an air pressure sensor 980C, a magnetic sensor 980D, an acceleration sensor 980E, a distance sensor 980F, a proximity light sensor 980G, a fingerprint sensor 980H, a temperature sensor 980J, a touch sensor 980K, and ambient light.
  • Sensor 980L Bone Conduction Sensor 980M, etc.
  • the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the electronic device 900 .
  • the electronic device 900 may include more or less components than shown, or combine some components, or separate some components, or arrange different components.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 970 may include one or more processing units, for example, the processor 910 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor graphics processor
  • graphics processor graphics processor
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • the controller may be the nerve center and command center of the electronic device 900 .
  • the controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 910 for storing instructions and data.
  • the memory in processor 910 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 910 . If the processor 910 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided, and the waiting time of the processor 910 is reduced, thereby increasing the efficiency of the system.
  • the processor 910 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / or universal serial bus (universal serial bus, USB) interface, etc.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transceiver
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB universal serial bus
  • the electronic device 900 implements a display function through a GPU, a display screen 994, an application processor, and the like.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 994 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 910 may include one or more GPUs that execute program instructions to generate or alter display information.
  • Display screen 994 is used to display images, videos, and the like.
  • Display screen 994 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light).
  • LED diode AMOLED
  • flexible light-emitting diode flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on.
  • the electronic device 900 may include 1 or N display screens 994, where N is a positive integer greater than 1.
  • the electronic device 900 can realize the shooting function through the ISP, the camera 993, the video codec, the GPU, the display screen 994 and the application processor.
  • the ISP is used to process the data fed back by the camera 993. For example, when taking a photo, the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye. ISP can also perform algorithm optimization on image noise, brightness, and skin tone. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be located in the camera 993.
  • the camera 993 is used to capture still images or video.
  • the object is projected through the lens to generate an optical image onto the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the electronic device 900 may include 1 or N cameras 993 , where N is a positive integer greater than 1.
  • the NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • Applications such as intelligent cognition of the electronic device 900 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
  • FIG. 10 is a block diagram of a software structure of an electronic device 900 according to an embodiment of the present invention.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces.
  • the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and a system library, and a kernel layer.
  • the application layer can include a series of application packages.
  • the application package may include applications (also referred to as applications) such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, and short message.
  • applications also referred to as applications
  • the application framework layer provides an application programming interface (API) and a programming framework for the applications of the application layer.
  • API application programming interface
  • the application framework layer includes some predefined functions.
  • the application framework layer may include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, a Local Profile Assistant (LPA), and the like.
  • a window manager a content provider
  • a view system a phone manager
  • a resource manager a notification manager
  • LPA Local Profile Assistant
  • a window manager is used to manage window programs.
  • the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.
  • Content providers are used to store and retrieve data and make these data accessible to applications.
  • the data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications.
  • a display interface can consist of one or more views.
  • the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
  • the phone manager is used to provide the communication function of the electronic device 900 .
  • the management of call status including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
  • the notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications from applications running in the background, and can also display notifications on the screen in the form of a dialog interface. For example, text information is prompted in the status bar, a prompt sound is issued, the electronic device vibrates, and the indicator light flashes.
  • the Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.
  • the core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.
  • the application layer and the application framework layer run in virtual machines.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
  • a system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.
  • surface manager surface manager
  • media library Media Libraries
  • 3D graphics processing library eg: OpenGL ES
  • 2D graphics engine eg: SGL
  • the Surface Manager is used to manage the display subsystem and provides a fusion of two-dimensional (2-Dimensional, 2D) and three-dimensional (3-Dimensional, 3D) layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
  • 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display drivers, camera drivers, audio drivers, sensor drivers, and virtual card drivers.
  • the workflow of the software and hardware of the electronic device 900 is exemplarily described below in conjunction with capturing a photographing scene.
  • the corresponding hardware interrupt is sent to the kernel layer.
  • the kernel layer processes touch operations into raw input events (including touch coordinates, timestamps of touch operations, etc.). Raw input events are stored at the kernel layer.
  • the application framework layer obtains the original input event from the kernel layer, and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and the control corresponding to the click operation is the control of the camera application icon, for example, the camera application calls the interface of the application framework layer to start the camera application, and then starts the camera driver by calling the kernel layer.
  • the camera 993 captures still images or video.
  • the following embodiments can be implemented on the mobile phone 900 having the above-mentioned hardware structure/software structure.
  • the following embodiments will take the mobile phone 900 as an example to describe the character selection method based on character recognition provided by the embodiments of the present application.
  • FIG. 11 a schematic flow chart of steps of a method for selecting characters of curved text provided by an embodiment of the present application is shown.
  • the method can be applied to the above-mentioned mobile phone 900, and the method can specifically include the following steps :
  • the original image processing includes curved text, and the text content of the straight text corresponds to the text content of the curved text one-to-one.
  • the CTC sequence output by the OCR detection and recognition model is processed, so that the angle of the prompt information that can be manually selected by the user is presented on the screen, so as to detect the characters of the curved text in the embodiment of the present application.
  • the introduction of the selection method that is, this embodiment describes the process in which the character selection model in FIG. 4 receives the input CTC sequence and processes the input CTC sequence.
  • the picture to be recognized may be obtained by detecting the original picture, and the above-mentioned original picture may be an image displayed on the screen of the mobile phone in real time after the user turns on the camera of the mobile phone, and the image contains curved text.
  • the to-be-recognized picture is the picture after converting the curved text into straight text.
  • the CTC sequence corresponding to the text content may refer to the sequence after preliminary processing of the above-mentioned input CTC sequence.
  • the CTC sequence is the intermediate output result of the OCR recognition model. After the recognition model outputs the CTC sequence, the final recognition result can be obtained by analyzing the CTC sequence through certain rules.
  • Any CTC sequence may contain multiple CTC sequence elements.
  • the element to be recognized is obtained by passing the feature map of the last layer of CNN through a 4 ⁇ 4 convolution kernel sliding window, which may be the element of the sequence to be recognized.
  • a character, or blank; the value of the element can be either the recognized character or the character's index.
  • the preliminary processing in this embodiment may include processing processes such as trimming and parameter initialization of the input CTC sequence.
  • the input CTC sequence can be cropped based on the CNN downsampling multiple in the subsequent processing process, so that the product of the cropped CTC sequence and the downsampling multiple is not greater than the image width of the image to be recognized.
  • the characters at the head of the input CTC sequence can be cut first, then the characters at the tail are cut, and then the characters at the head are cut again.
  • bl is a blank character (blank), and the size of the image to be recognized is 32*199 (height*width) pixels.
  • the current CTC sequence is returned.
  • Parameter initialization for the tailored CTC sequence can include:
  • Character boundaries may include left and right boundaries, ie the positions of the left and right ends of each element or character in the CTC sequence, and the character boundaries may be divided into boundaries in the CTC sequence and boundaries in the original image. What needs to be initialized in this step is the boundary in the CTC sequence.
  • the current state can refer to the type of each element. For example, numbers, Chinese, Japanese, Korean or English, etc.
  • the coordinate array can refer to an array that stores the index coordinates of all characters
  • the content array is an array that stores the content of each character or word. During initialization, both the coordinate array and the content array can be set to empty.
  • the first character "cheng” can be processed first, and the right boundary of the character and the left boundary of the next character “ben” can be recorded. According to the above method, after processing all the characters, each different character and its corresponding character can be obtained. range of array indices.
  • the boundaries of different types of characters can also be fine-tuned according to certain rules.
  • the average value of the coordinates of the left and right borders can be calculated first according to the coordinates of the right border of the previous character and the coordinates of the left border of the next character. For example, in the CTC sequence of the above example, for the adjacent characters “cheng” and "ben", the coordinates of the right boundary of the character “cheng” and the left side of the left boundary of the character “ben” can be calculated to obtain the Coordinate mean.
  • the new left and right boundaries of each character can be obtained.
  • the calculated coordinate mean value can be used to subtract the above offset to obtain the new right boundary coordinates of the previous character; and for the next character, the coordinates can be used Add the above offset to the mean to get the new left boundary coordinate of the following character.
  • the offset offset can be determined according to the following rules:
  • the offset can be 1/8; for Western (Latin), the offset can be 1/2; for numbers, the offset can also be 1/2.
  • OCR recognition takes a size of 32*512 pixels as input, the output feature map size is 4*65 pixels, and the downsampling multiple is 8. Finally, when performing recognition, a 4*4 pixel convolution kernel is used for sliding window. , that is, an element in the CTC sequence corresponds to the range covered by the 8-pixel width of the original image.
  • the character boundary of each character calculated in the previous step is multiplied by the CNN downsampling multiple to obtain the coordinates of each character in the to-be-recognized picture.
  • the coordinates of each character in the image to be recognized can be obtained, and the unit of the coordinates is (one) pixel:
  • the display is based on the above coordinates, there may be some pixels in the first and last two characters of each line sequence that are outside the pixels corresponding to the coordinates.
  • the character may have some pixels further to the left of coordinate 8; similarly, for the last character "member”, if the display is performed according to the right boundary coordinate 184 display, the character may have some pixels further to the right of coordinate 184.
  • the coordinates of the first and last characters can also be fine-tuned.
  • the correspondence of the fine-tuning can be the coordinates of the left boundary of the first character in the picture to be recognized, and the coordinates of the right boundary of the last character in the picture to be recognized.
  • the final coordinates of the left and right borders of each character in the image to be recognized obtained after fine-tuning the coordinates of the first and last characters can be:
  • a certain value is subtracted from the coordinates of the left boundary of the first character, and a certain value is added to the coordinates of the right boundary of the last character.
  • the above-mentioned subtracted or added values may be the same or different. Not limited.
  • the coordinates obtained above are only the left and right boundary coordinates of each character in the to-be-recognized picture. Therefore, in the subsequent processing process, the coordinates of the four vertices of each character can be obtained by combining the height of the to-be-recognized picture, that is, the upper left, The coordinates of the lower left, upper right and lower right vertices.
  • the coordinates of its four vertices can be expressed as [(2, 32), (2, 0), (35.0, 0), (35.0, 32)].
  • FIG. 12 it is a schematic diagram of the display effect of the coordinates of each character on the picture to be recognized.
  • each character has a corresponding rectangular segment.
  • the vertex coordinates of each character are not necessarily in the same rectangular segment. For example, the coordinates of the four vertices of "cheng" are all within the first rectangular segment. But the bottom left and top left of "will" are inside the third rectangle segment, but the bottom right and top right are inside the fourth rectangle segment.
  • a character selection control may be generated in the original image according to the second coordinate.
  • the above character selection control may draw the character lines in each text area as a highlighted background, so as to inform the user of the currently recognized character range.
  • the click event of the user on the mobile phone screen that is, on the original picture presented by the mobile phone screen.
  • the clicked entire line of text can be set to the selected state, and the line of text can be drawn as a dark highlight background.
  • a draggable handle can appear at the beginning and end of the entire line of text, and the user can modify the selected text range by dragging the handle.
  • FIG. 14a and FIG. 14b it is a schematic diagram of a character selection control according to an embodiment of the present application.
  • the background of each line of characters can be drawn in a certain color as shown in Figure 14a in units of character lines.
  • a draggable handle can be drawn at the beginning and end of the clicked line of characters, and the line of characters can be drawn different from that in Figure 14a , to inform the user that a selection is available for characters on the current line.
  • the CTC sequence index output by the detection and recognition model is established to
  • the mapping relationship between the coordinates of the original image, the model used in the whole process is simple, and the amount of data that needs to be marked is small; when processing, compared with the prior art, it is necessary to provide the original image or the corrected image, or it is necessary to traverse all the pixels on the image.
  • the embodiment of the present application can use real-time detection or collected pictures, the calculation speed is fast, and it can be quickly and easily adapted and deployed into the existing OCR model.
  • the superparameter when determining the character boundary in the embodiment of the present application, can be set according to different languages to fine-tune the character boundary to ensure the accuracy of character boundary recognition, which is not only applicable to most languages, but also supports the selection of symbols.
  • the embodiment of the present application obtains the coordinates of each character in the original picture by recording the coordinates of the four vertices of each character in the picture to be recognized, and mapping the coordinates of the four vertices to the original picture according to the segmented area, so that the coordinates of each character in the original picture are obtained.
  • the granularity of the coordinate output is finer.
  • the recognition between characters and characters, characters and symbols is accurate, and the segmentation of adjacent characters can be quickly realized, which has stronger universality in practical applications. sex.
  • the character selection method provided by the embodiments of the present application also has stronger robustness to character location in natural scenes.
  • Fifth, in the embodiment of the present application by associating the recognized coordinates of each character with the second coordinates of each character in the original picture, and drawing the corresponding recognized text area according to the second coordinates, it can support the user to Arbitrary selection and range adjustments are made in the recognized text area.
  • the above text recognition and selection scenarios can also be associated with functions such as translation and question scanning.
  • the selected text can be directly translated, or the identified question can be answered; or, by parsing the string selected by the user. , if it contains text content in the format of telephone and mailbox, a separate card can also be extracted, which is convenient for users to make calls or send emails directly, which has strong ease of use and practicability.
  • FIG. 15 a schematic flow chart of steps of a method for selecting characters of curved text provided by another embodiment of the present application is shown, and the method may specifically include the following steps:
  • this embodiment is an introduction to the character selection method in this embodiment from the perspective of the interaction between the user and the mobile phone, that is, this embodiment describes when the user uses the mobile phone lens to perform a text selection operation , the interaction process between the user and the phone screen.
  • the terminal device in this embodiment may be a mobile phone, a tablet computer, a smart lens and other devices, and the specific type of the terminal device is not limited in this embodiment.
  • the terminal device Take the terminal device as a mobile phone as an example. Users can turn on the camera of the mobile phone and control the lens to aim at a certain picture or object to shoot. Usually, the image presented on the screen may also shake because the user may shake when holding the mobile phone. Therefore, the user can fix the image by clicking on a certain position on the screen, that is, fix the frame.
  • the image captured by the camera in real time that is, the image displayed on the screen at that moment, is the original image that needs to be input to the OCR detection and recognition model.
  • the original picture contains curved text content to be recognized, and the above-mentioned text area may be one or more lines, and each line of text area may contain different types of characters or symbols, which is not limited in this embodiment.
  • the OCR detection and recognition model can automatically identify each text region in the picture, and output a CTC sequence corresponding to each line of characters or symbols in each text region.
  • the above CTC sequence can be provided to the character selection model for processing to obtain the second coordinates of each character in the original picture.
  • the OCR recognition model is used to identify the characters in the text area, the corresponding CTC sequence is output, and then the character selection model provided in this embodiment is used to process the CTC sequence, and the coordinates of each character in the CTC sequence are calculated and mapped as
  • the process of the second coordinate of each character in the original picture is similar to steps 1101 to 1108 in the foregoing embodiment, which can be referred to each other, and will not be repeated in this embodiment.
  • a character selection control may be generated in the original image according to the second coordinate.
  • the above-mentioned character selection control may be to draw the character line in each text area as a highlighted background of the first color, so as to inform the user of the currently recognized character range.
  • a highlighted background of the first color covering all the characters in the entire line can be drawn according to the obtained vertex coordinates of each character in the line; The vertex coordinates of the first and last characters, and draw a highlight background covering this character line.
  • Character selection controls are drawn according to the second coordinates of each character in the original picture, as shown in FIG. 14a.
  • the selection event for the target character may refer to an event generated by the user clicking a certain text area on the screen of the mobile phone on which the character selection control has been drawn. For example, a click operation performed by the user in any text area drawn with a highlighted background in Figure 14a.
  • the position of the currently clicked row can be calculated first, and then the clicked text is set as the selected state, and the text background of the second color is drawn.
  • the terminal device monitors the user's moving area, calculates the starting position and ending position of the moving area, sets the area contained in the moving area to the selected state, and draws the area as a dark highlight background.
  • a draggable handle appears at the beginning and end of the area, and the user can modify the selected text range by dragging the handle.
  • the background color to be drawn may be determined according to actual needs, which is not limited in this embodiment.
  • a corresponding draggable handle symbol may also be generated at the beginning and end of the text selected by clicking, which is used to prompt the user to modify the selected range by dragging the handle.
  • the selected text range can be re-determined according to the position of the handle dragged by the user.
  • the position of the dragged left handle and the right handle that has not been dragged can be identified.
  • Position as the selected text range determined by reproduction. That is, from the right side of the dragged left handle to the last character of the line, that is, the selected text range.
  • the position of the dragged right handle and the position between the dragged left handle and the left handle that has not been dragged can be identified as the selected text range determined to reproduce . That is, from the first character of the line to the character on the left of the dragged right handle, that is, the selected text range.
  • the user can also drag the left handle and the right handle successively.
  • the re-determined text range is the first character on the right side of the current position of the left side after being dragged, to the position of the dragged one. All characters between the first character on the left of the current position of the right handle.
  • a text card can pop up on the mobile phone interface to display the text content selected by the user.
  • whether the user completes the click operation can wait for a certain period of time after the user drags a handle, such as 1 second. If the user's dragging a handle is not monitored again, it can be considered that the user has completed the selection. operate.
  • FIG. 16a-FIG. 16c it is a schematic diagram showing an example effect when a character is selected by the above-mentioned character selection method.
  • the selected character is shown as "Cost will” in Figure 16a.
  • the user can drag the right handle to move it to the left of the character “hui”, showing the effect shown in Figure 16b, and the selected character at this time is "cost”. If the user drags the right handle to the right and moves it to the right side of the character "meter”, the effect shown in Figure 16c can be obtained, and the selected character at this time is "cost accounting".
  • buttons may also be provided on the text card.
  • the user may click on different buttons to instruct the application program corresponding to the button to perform corresponding operations according to the displayed text.
  • FIG. 17 it is a schematic diagram of outputting the selected target character after the user selects the target character.
  • the above-mentioned selected target character may be obtained by dragging the left and right handles of the character selection control in Fig. 14b or Fig. 16a-Fig. 16c.
  • the user can drag the left handle to the left side of the character "meter", so as to obtain the selection effect as shown in Figure 17, the currently selected target character is "cost meeting" .
  • the user can click the corresponding button to perform translation, search and other processing on the above target characters.
  • the mobile phone when it is detected that the user clicks buttons such as translation and search, it indicates that the user wishes the mobile phone to perform corresponding processing on the selected target text. At this point, the mobile phone can call the application program that can perform the above operations.
  • the application can be directly called for processing. For example, after the user clicks the "translate" button, if there is only a dictionary application in the mobile phone that can translate the above target characters, the program can be called directly. If there are multiple applications installed in the mobile phone that can perform the actions the user wants to process, for example, after the user clicks the "Search" button, there are three applications (application A, application B and application C) that can When this operation is performed, the interface shown in FIG. 18 may pop up in the interface of the mobile phone for the user to select an application to perform the operation. For a selected app, the user can also set it as the default app. When the same processing is performed subsequently, the default application can be called directly to perform the operation.
  • the CTC sequence index output by the detection and recognition model is established to
  • the mapping relationship between the coordinates of the original image, the model used in the whole process is simple, and the amount of data that needs to be marked is small; when processing, compared with the prior art, it is necessary to provide the original image or the corrected image, or it is necessary to traverse all the pixels on the image.
  • the embodiment of the present application can use real-time detection or collected pictures, the calculation speed is fast, and it can be quickly and easily adapted and deployed into the existing OCR model.
  • the embodiment of the present application determines the character boundary
  • the superparameters can be set according to different languages to fine-tune the character boundary to ensure the accuracy of character boundary recognition, which is not only applicable to most languages, but also supports the selection of symbols.
  • the embodiment of the present application obtains the coordinates of each character in the original picture by recording the coordinates of the four vertices of each character in the picture to be recognized, and mapping the coordinates of the four vertices to the original picture according to the segmented area, so that the coordinates of each character in the original picture are obtained.
  • the granularity of the coordinate output is finer.
  • the recognition between characters and characters, characters and symbols is accurate, and the segmentation of adjacent characters can be quickly realized, which has stronger universality in practical applications. sex.
  • the character selection method provided by the embodiments of the present application also has stronger robustness to character location in natural scenes.
  • Fifth, in the embodiment of the present application by associating the recognized coordinates of each character with the second coordinates of each character in the original picture, and drawing the corresponding recognized text area according to the second coordinates, it can support the user to Arbitrary selection and range adjustments are made in the recognized text area.
  • the above text recognition and selection scenarios can also be associated with functions such as translation and question scanning.
  • the selected text can be directly translated, or the identified question can be answered; or, by parsing the string selected by the user. , if it contains text content in the format of telephone and mailbox, a separate card can also be extracted, which is convenient for users to make calls or send emails directly, which has strong ease of use and practicability.
  • FIG. 19 shows a structural block diagram of a character selection apparatus according to an embodiment of the present application. For convenience of description, only the part related to the embodiment of the present application is shown.
  • the character selection device 1900 includes: a display module 1901, a processing module 1902, and a storage module 1903, wherein the display module 1901, the processing module 1902, and the storage module 1903 are connected through a bus.
  • the character selection apparatus 1900 may be the terminal device in the foregoing method embodiments, or may be configured as one or more chips in the terminal device.
  • the character selection apparatus 1900 may be used to execute part or all of the functions of the terminal device in the above method embodiments.
  • the display module 1901 is configured to display an original picture in the display interface of the terminal device, where the original picture contains curved text;
  • the processing module 1902 is configured to detect the original picture, and generate a picture to be recognized that contains straight text, and the text content of the straight text corresponds to the text content of the curved text one-to-one; the text content of the straight line text, obtain a connection time sequence classification sequence corresponding to the text content of the straight line text, and the connection time sequence classification sequence includes a plurality of characters; calculate each character in the plurality of characters the first coordinate in the to-be-recognized picture; determine the segmented area where the first coordinate corresponding to each of the plurality of characters is located; according to the original picture and the to-be-recognized picture, determine the When the original picture is transformed into the to-be-recognized picture, a perspective transformation matrix corresponding to each segmented area is obtained; the first coordinate of each character in the plurality of characters is multiplied by the perspective transformation matrix to obtain the plurality of The second coordinate of each character in the original picture in the original picture; the first operation of the user on the original picture is detected, and
  • the display module 1901 is configured to highlight the selected character according to the second coordinate of each character in the plurality of characters in the original picture.
  • the processing module 1902 is further configured to generate first prompt information in the original picture according to the second coordinate corresponding to each character in the plurality of characters, and the first prompt information is used for Indicates that the user can select characters in the original picture.
  • the processing module 1902 is specifically configured to identify the text content of the straight line text according to the to-be-recognized picture, and obtain an initial connection sequence classification sequence; determine the length of the initial connection sequence classification sequence, and determine The picture width of the picture to be identified; if the product of the length of the initial connection sequence classification sequence and the preset downsampling multiple is greater than the picture width of the to-be-identified picture, then the initial connection sequence classification sequence is cropped, Obtain a connection time sequence classification sequence corresponding to the text content of the straight text; wherein, the product of the length of the connection time sequence classification sequence obtained after cropping and a preset downsampling multiple is less than or equal to the to-be-recognized picture image width.
  • the processing module 1902 is specifically configured to cut the head element or tail element of the initial connection timing classification sequence in turn; after cutting any head element or tail element, calculate the cut initial connection timing classification sequence. Whether the product of the length and the preset downsampling multiple is less than or equal to the picture width of the picture to be identified; if the product of the length of the cropped initial connection sequence classification sequence and the preset downsampling multiple is less than or equal to the After identifying the picture width of the picture, the cropping is stopped, and the connection time sequence classification sequence corresponding to the text content of the straight line text is output.
  • the processing module 1902 is specifically configured to determine the character boundary coordinates of each of the plurality of characters in the connection sequence classification sequence, where the character boundary coordinates include left boundary coordinates and right boundary coordinates; The character boundary coordinates of each of the plurality of characters are calculated, and the first coordinates of each of the plurality of characters in the to-be-recognized picture are calculated.
  • the processing module 1902 is specifically configured to obtain the original right boundary coordinates of the character and the original left boundary coordinates of the next character for any character in the connection sequence classification sequence; calculate the original right boundary The average value of the coordinates and the original left border coordinates; based on the average value, the right border coordinates of the character, and the left border coordinates of the next character are determined.
  • the processing module 1902 is specifically configured to respectively determine the first character type of the character and the second character type of the next character, and the first character type and the second character type have corresponding Offset; calculate the average value minus the offset corresponding to the first character type to obtain a first difference value, and use the first difference value as the right boundary coordinate of the character; calculate the average value The offset corresponding to the second character type is added to the value to obtain a second sum value, and the second sum value is used as the left boundary coordinate of the next character.
  • the first coordinates of each character in the to-be-recognized picture include the first vertex coordinates, the second vertex coordinates, the third vertex coordinates and the fourth vertex coordinates;
  • the processing module 1902 is specifically used to The left boundary coordinate and the right boundary coordinate of each character in the plurality of characters are respectively multiplied by a preset downsampling multiple to obtain the first vertex coordinate of each character in the to-be-recognized picture of the plurality of characters and the second vertex coordinates; according to the first vertex coordinates, the second vertex coordinates and the picture height of the picture to be recognized, determine the third vertex coordinates of each character in the plurality of characters in the picture to be recognized and the fourth vertex coordinates.
  • the processing module 1902 is specifically configured to traverse the first coordinates of each character and the segmented area of the to-be-recognized picture, and respectively determine the first segmented area where the coordinates of the first vertex are located.
  • the processing module 1902 is specifically configured to multiply the first vertex coordinates by the perspective transformation matrix of the first segmented area to obtain third coordinates, and combine the second vertex coordinates with the second
  • the fourth coordinate is obtained by multiplying the perspective transformation matrix of the segmented area
  • the fifth coordinate is obtained by multiplying the third vertex coordinate with the perspective transformation matrix of the third segmented area
  • the fourth vertex coordinate is
  • the sixth coordinate is obtained by multiplying the perspective transformation matrix of the fourth segment area; wherein, the third coordinate, the fourth coordinate, the fifth coordinate and the sixth coordinate are used as the characters in the original picture. Second coordinate.
  • the processing module 1902 is specifically configured to draw the character area of each character in the plurality of characters in the original picture as the first coordinate according to the second coordinate corresponding to each character in the plurality of characters. a color; or, according to the second coordinate corresponding to each of the plurality of characters, drawing a text box in the original picture for the character area of each of the plurality of characters.
  • the processing module 1902 is further configured to draw the entire line of character areas including the clicked character area as a second color when monitoring the click event of the user in the character area; A drag event of the character area corresponding to the second color is performed, and the character area covered by the second color is adjusted according to the drag event; each character in the character area covered by the second color is identified and displayed.
  • the processing module 1902 is further configured to draw the character area closest to the click position indicated by the click event as a second color when monitoring the click event of the user in the character area; monitor the user In the drag event of the character area corresponding to the second color, adjust the character area covered by the second color according to the drag event; identify and display each character in the character area covered by the second color .
  • the processing module 1902 is further configured to draw the character area containing the indication of the sliding event as a second color when monitoring the sliding event of the user in the character area; monitor the user's sliding event in the second color. For the drag event of the corresponding character area, adjust the character area covered by the second color according to the drag event; identify and display each character in the character area covered by the second color.
  • the storage module 1903 is used for storing necessary program instructions and data of the character selection device.
  • FIG. 20 shows a schematic diagram of a possible structure of a character selection apparatus 2000 in the above embodiment, and the character selection apparatus 2000 may be configured as the aforementioned user equipment.
  • the first communication apparatus 2000 may include: a processor 2002 , a computer-readable storage medium/memory 2003 , a transceiver 2004 , an input device 2005 and an output device 2006 , and a bus 2001 . Wherein, processors, transceivers, computer-readable storage media, etc. are connected through a bus.
  • the embodiments of the present application do not limit the specific connection medium between the above components.
  • the output device 2006 displays an original picture, and the original picture contains curved text; the processor 2002 detects the original picture, and generates a to-be-recognized picture that includes straight text, the text content of the straight text is the same as all the text. One-to-one correspondence between the text content of the curved text;
  • the text content of the straight line text is recognized, and a connection time sequence classification sequence corresponding to the text content of the straight line text is obtained, and the connection time sequence classification sequence includes a plurality of characters; calculating the first coordinate of each character in the plurality of characters in the to-be-recognized picture; determining the segmented area where the first coordinate corresponding to each character in the plurality of characters is located; according to the original picture and the to-be-recognized picture, determine the perspective transformation matrix corresponding to each segmented area when the original picture is transformed into the to-be-recognized picture; compare the first coordinate of each character in the plurality of characters with the perspective transformation Matrix multiplication to obtain the second coordinates of each of the multiple characters in the original picture; detecting the first operation of the user on the original picture, and the first operation is used to Characters in the curved text on the picture are selected; the output device 2006 highlights the selected character according to the second coordinate of each of the plurality of characters in the original picture
  • the processor 2002 may include baseband circuitry.
  • Transceiver 2004 may include radio frequency circuitry.
  • the processor 2002 may run an operating system to control functions between various devices and devices.
  • Transceiver 2004 may include baseband circuitry and radio frequency circuitry.
  • the transceiver 2004 and the processor 2002 can implement the corresponding steps in any of the foregoing embodiments in FIG. 3 to FIG. 18 , and details are not repeated here.
  • FIG. 20 only shows a simplified design of the character selection device.
  • the character selection device can include any number of transceivers, processors, memories, etc., and all of them can realize the character selection of the present application.
  • the devices are all within the scope of protection of the present application.
  • the processor 2002 involved in the above-mentioned apparatus 2000 may be a general-purpose processor, such as a CPU, a network processor (NP), a microprocessor, etc., or an ASIC, or one or more programs used to control the solution of the present application. implemented integrated circuits. It can also be a digital signal processor (DSP), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components.
  • DSP digital signal processor
  • FPGA field-programmable gate array
  • a controller/processor may also be a combination that implements computing functions, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, and the like. Processors typically perform logical and arithmetic operations based on program instructions stored in memory.
  • the above-mentioned bus 2001 may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is shown in FIG. 20, but it does not mean that there is only one bus or one type of bus.
  • the computer-readable storage medium/memory 2003 mentioned above may also store an operating system and other application programs.
  • the program may include program code, and the program code includes computer operation instructions.
  • the above-mentioned memory may be ROM, other types of static storage devices that can store static information and instructions, RAM, other types of dynamic storage devices that can store information and instructions, disk storage, and the like.
  • the memory 2003 may be a combination of the above storage types.
  • the above-mentioned computer-readable storage medium/memory may be in the processor, outside the processor, or distributed over multiple entities including the processor or processing circuit.
  • the computer-readable storage medium/memory described above may be embodied in a computer program product.
  • a computer program product may include a computer-readable medium in packaging materials.
  • the embodiments of the present application further provide a general-purpose processing system, for example, commonly referred to as a chip, where the general-purpose processing system includes: one or more microprocessors that provide processor functions; and an external memory that provides at least a part of a storage medium , all of which are connected together with other support circuits through an external bus architecture.
  • the processor is caused to execute some or all of the steps in the data transmission method of the first communication device in the embodiment of FIGS. other processes of technology.
  • the steps of the methods or algorithms described in conjunction with the disclosure of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions.
  • the software instructions can be composed of corresponding software modules, and the software modules can be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable hard disk, CD-ROM or any other form of storage well known in the art in the medium.
  • An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and storage medium may reside in an ASIC. Alternatively, the ASIC may be located in the terminal.
  • the processor and storage medium may also exist in the character selection device as discrete components.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Character Input (AREA)

Abstract

本申请实施例适用于人工智能技术领域,提供了一种弯曲文本的字符选择方法和装置。包括:显示并检测原始图片,生成包含直线文本的待识别图片;然后获取与文本内容相对应的连接时序分类序列,并计算连接时序分类序列中每个字符在该待识别图片中的第一坐标;确定第一坐标在待识别图片中所处的分段区域;将每个字符的第一坐标与原始图片与待识别图片之间的分段透视变换矩阵相乘,得到该每个字符在该原始图片中的第二坐标;检测到用户在所述原始图片上的第一操作,根据所述第二坐标突出显示被选中的字符。采用上述方法,可以在用户对弯曲文本中的字符进行手动选择时,提高字符位置的定位精准度以及提高手动选择字符的效率和准确率。

Description

一种弯曲文本的字符选择方法、装置和终端设备
本申请要求于2020年10月31日提交中国国家知识产权局、申请号为202011199028.4、发明名称为“一种弯曲文本的字符选择方法、装置和终端设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于人工智能技术领域,尤其涉及一种弯曲文本的字符选择方法、装置和终端设备。
背景技术
光学字符识别(Optical Character Recognition,OCR)是一种通过手机、扫描仪或数码相机等电子设备检查纸上的字符,并基于字符的暗、亮模式确定其形状,然后用字符识别方法将形状翻译成计算机文字的过程。OCR是计算机视觉(Computer Vision,CV)领域的一个重要应用场景,也为增强现实(Augmented Reality,AR)技术在翻译、图像语义理解等众多领域的应用提供了基础能力。通常,OCR可以包括两个步骤,即文本区域检测和文字内容识别,前一步骤可以检测图像中何处是文本区域,后一步骤则可以识别出文本区域中的文本具体是什么内容。而在自然场景的文本中,文本的形状在很多情况下往往不是水平的,其形状可能是圆弧形的弯曲文本,也可能是波浪形的弯曲文本。而弯曲文本的场景对OCR的检测和识别都带来了很大的挑战。
在手机智能镜头识物场景中,智能镜头可以完成多物体检测和文本检测识别任务,并且会提供物体和文本框位置,用户可以点击手机屏幕中的文本行,此时屏幕会定帧,用户可以点选检测和识别到的文字内容,其效果类似于文本文档里的用鼠标对文本内容逐字符拖选的功能。针对点选到的内容,用户可以选择复制、翻译或搜索等后续操作。
文字点选场景分为直线文本场景和弯曲文本场景。直线文本场景中,文本行都为带角度的矩形框、平行四边形框或者四边形框,文本行一般由四个顶点描述。在弯曲文本场景中,为了描述弯曲的走向,文本框一般是由一个多边形组成,该多边形又可以分解为诸多四边形,由这些四边形拼接成该多边形。在直线文本场景中,由于原图和待识别图之间只是简单的透视变换映射关系,如果获取了待识别图上各个文字的坐标,可以很容易获取在原图上直线文本中每个文本的坐标。然而,在弯曲文本场景中,由于较强的不连续性,以及弯曲文本本身较为复杂的描述方式,获取各个文字的坐标要比直线场景中要困难得多。
发明内容
本申请实施例提供了一种弯曲文本的字符选择方法、装置和终端设备,用于实现准确获取弯曲文本中各个文字的坐标,从而实现字符选择更精确。
第一方面,本申请实施例提供了一种弯曲文本的字符选择方法,应用于终端设备,所述方法包括:
终端设备的显示界面中显示原始图片,所述原始图片中包含弯曲文本;然后该终端设备检测该原始图片,生成包含有直线文本的待识别图片,其中该直线文本的文本内容与该弯曲文本的文本内容一一对应;然后终端设备根据所述待识别图片,识别得到与所述直线文本的所述文本内容相对应的连接时序分类(connectionist temporal classification, CTC)序列,其中,所述连接时序分类序列包括了多个字符;然后该终端设备再计算CTC序列中该多个字符中每个字符在该待识别图片中的第一坐标;然后终端设备确定该多个字符中每个字符的第一坐标在待识别图片中所处的分段区域;然后终端设备根据该原始图片和该待识别图片,确定将该原始图处理变换为该待识别图片时的各个分段区域对应的透视变换矩阵;再将该多个字符中每个字符的第一坐标与所述透视变换矩阵相乘,得到该多个字符中每个字符在该原始图片中的第二坐标;最后终端设备检测到用户在所述原始图片上的第一操作,根据所述多个字符中每个字符在所述原始图片中的第二坐标突出显示被选中的字符,所述第一操作用于对所述原始图片上的弯曲文本中的字符进行选择。
本申请实施例提供的弯曲文本的字符选择方法通过计算文本内容中每个字符在与该文本内容相对应的CTC序列中的坐标,然后根据CTC序列索引与待识别图片的文本坐标之间的对应关系得到每个字符在待识别图片中的第一坐标,然后根据该待识别图片与原始图片之间的分段透视变换关系,对于第一坐标进行一一对应计算得到每个字符在原始图片中的第二坐标,提高了第二坐标计算的准确度,同时也提高了根据第二坐标绘制得到的字符选择控件的精准度,使得用户在对原始图片中的字符进行点选时,终端设备能够准确地对字符进行定位,输出已选中的字符,提高字符选择以及识别的效率和准确率。同时,配置于终端设备的OCR检测模型可以对原始图片进行检测,生成包含有原始图片中弯曲文本的文本内容的待识别图片,上述待识别图片可以作为OCR识别模型的输入数据,通过OCR识别模型对上述文本内容进行识别,可以得到与该文本内容相对应的CTC序列。本申请实施例提供的弯曲文本的字符选择方法可以直接采用终端设备中已配置的具备弯曲文本检测能力的OCR检测模型和OCR识别模型,有助于扩大本方法的应用范围,降低终端设备采用本方法的技术难度。
可选的,该终端设备在得到该多个字符中每个字符的第二坐标之后,检测该第一操作之前,该终端设备还可以根据所述多个字符中每个字符对应的所述第二坐标,在所述原始图片中生成第一提示信息,所述第一提示信息用于指示用户可对所述原始图片中的字符进行选择。这样可以更方便用户进行字符点选。
可选的,与文本内容相对应的CTC序列可以是指对于OCR识别模型输出的初始CTC序列进行处理后的序列,对初始CTC序列进行处理可以通过本申请实施例提供的字符选择模型实现。通过OCR识别模型对上述文本内容进行识别,可以输出初始CTC序列。然后,字符选择模型可以确定初始CTC序列的长度以及待识别图片的图片宽度,判断初始CTC序列的长度与字符选择模型的下采样倍数之间的乘积是否大于待识别图片的图片宽度。如果上述乘积大于待识别图片的图片宽度,则需要对初始CTC序列进行一定的裁剪,使得裁剪后的CTC序列的长度与模型的下采样倍数之间的乘积小于或等于待识别图片的图片宽度。
可选的,对CTC序列进行裁剪可以按照依次裁剪初始CTC序列的头部元素或尾部元素来进行。对于需要裁剪的初始CTC序列,可以首先裁剪该序列的一个头部元素,然后再裁剪一个尾部元素。在每裁剪一个头部元素或一个尾部元素后,可以再次计算裁剪后的序列的长度与预设的下采样倍数的乘积是否已经小于或等于待识别图片的图片宽度。如果在某一次的裁剪后,得到的序列的长度与预设的下采样倍数的乘积已经小于或等于待识别图片 的图片宽度,则可以停止剪裁,将当前获得的CTC序列进行输出,即为与文本内容相对应的CTC序列。如果某一次裁剪后得到的序列的长度与预设的下采样倍数的乘积仍然大于待识别图片的图片宽度,则需要按照上述顺序继续进行裁剪,直到序列长度与下采样倍数的乘积小于或等于待识别图片的图片宽度。
本申请实施例通过对CTC序列进行部分裁剪,可以减少后续处理的数据量,提高处理效率。
可选的,在完成对CTC序列的裁剪后,可以对参数进行初始化。需要初始化的参数可以包括左边界、有边界、当前状态、之前状态、坐标数组、内容数组等等。
可选的,计算CTC序列中每个字符的第一坐标在该待识别图片中的第一坐标可以通过确定CTC序列中每个字符的左边界坐标和右边界坐标来完成。结合每个字符在CTC序列中的左边界坐标和右边界坐标,可以得到该字符在待识别图片中的第一坐标。
可选的,每个字符在CTC序列中的字符边界可以通过确定该字符的右边界以及相邻的下一字符的左边界的方式获得。针对CTC序列中的任一字符,可以首先获取该字符的原始右边界坐标,以及下一字符的原始左边界坐标,再计算上述原始右边界坐标与原始左边界坐标的平均值,通过对平均值进行微调,可以得到该字符的右边界坐标,以及下一字符的左边界坐标。
可选的,对各个字符的边界进行微调可以根据不同的字符类型来进行。因此,在对字符边界进行微调时,可以分别确定该字符的第一字符类型和下一字符的第二字符类型,如果第一字符类型和第二字符类型不同,则二者分别具有不同的偏移量,如果二者字符类型相同,则偏移量也相同。对于当前字符,可以通过计算上述平均值减去当前字符的第一字符类型所述对应的偏移量,得到第一差值,然后将第一差值作为当前字符微调后的右边界坐标;对于相邻的下一字符,可以通过计算平均值加上第二字符类型对应的偏移量,得到第二和值,然后将第二和值作为下一字符微调后的左边界坐标。
由于各个字符在CTC序列中的第一坐标与该字符在待识别图片中的坐标具有对应关系,本申请实施例通过对各个字符在CTC序列中的边界进行微调,可以防止在后续的处理过程中,点选某个字符时,根据第二坐标生成的字符选择空间无法完全覆盖该字符的问题,保证了字符定位位置的准确性。
可选的,由于CTC序列中的每个元素是由待识别图片经过OCR识别模型中的卷积神经网络按照一定倍数下采样并在最后一层特征图上通过4×4卷积核进行滑窗得到,因此,CTC序列中一个元素对应待识别图片中下采样倍数个像素宽度覆盖的范围,进而,在根据每个字符的字符边界坐标,计算每个字符在待识别图片中的第一坐标时,可以将每个字符的左边界坐标和右边界坐标分别与预设的下采样倍数相乘,得到每个字符在待识别图片中的左边界位置和右边界位置,然后根据该左边界位置和该右边界位置得到该字符在待识别图片中的第一顶点坐标和第二顶点坐标。在此基础上,结合待识别图片的图片高度,可以确定出每个字符在待识别图片中的第三顶点坐标和第四顶点坐标,即得到每个字符在该待识别图片中的四个顶点坐标。每个字符在该待识别图片中按照矩形框进行划分,每个字符占据一个矩形框,该四个顶点坐标用于描述每个字符在该待识别图片中所处矩形框。可选 的,在得到每个字符的第一顶点坐标、第二顶点坐标、第三顶点坐标以及第四顶点坐标之后,该终端设备可以遍历该字符,然后分别确定所述第一顶点坐标所处第一分段区域,所述第二顶点坐标所处第二分段区域,所述第三顶点坐标所处第三分段区域和所述第四顶点坐标所处第四分段区域。其中,该分段区域用于指示该待识别图片与原图片之间对应的分段透视变换区域。比如某个字符的第一顶点坐标位于该待识别图片中的A矩形框,而该A矩形框对应于原图片中的A区域,而第二顶点坐标位于该待识别图片中的B矩形框,则该B矩形框对应于原片中的B区域,因此该字符在待识别图片中的坐标在映射到原图片中时,第一顶点坐标与A矩形框与A区域之间透视变换矩阵相乘得到该第一顶点坐标在原图片中的坐标,而该第二顶点坐标与B矩形框与B区域之间透视变换矩阵相乘得到该第二顶点坐标在原图片中的坐标。
可选的,在得到各个顶点坐标所处的分段区域之后,该终端设备根据将所述第一顶点坐标与所述第一分段区域的透视变换矩阵相乘得到第三坐标,将所述第二顶点坐标与所述第二分段区域的透视变换矩阵相乘得到第四坐标,将所述第三顶点坐标与所述第三分段区域的透视变换矩阵相乘得到第五坐标,将所述第四顶点坐标与所述第四分段区域的透视变换矩阵相乘得到第六坐标;其中,所述第三坐标、所述第四坐标、所述第五坐标和所述第六坐标作为字符在所述原始图片中的第二坐标。这样确定每个字符的四个顶点坐标所处的分段区域,并根据相应的分段区域的透视变换矩阵得到第二坐标,提高了第一坐标的计算准确性,保证了在点选某个字符时,字符选择控件可以完整覆盖该字符所在的区域。
可选的,字符选择控件可以通过不同的背景颜色进行展示。在原始图片中生成字符选择控件时,可以根据第二坐标,将每个字符的字符区域绘制为第一颜色。第一颜色可以是任意一种颜色,但需要注意的是,对于将字符区域绘制为第一颜色后,应当避免由于第一颜色而给用户识别字符造成不便。本申请实施例通过在原始图片中生成字符选择控件,可以提示用户哪些区域是已识别到的文本区域,方便用户在该区域中对字符进行点选。
可选的,当终端设备在原始图片中生成字符选择控件后,可以实时地对用户在字符区域的点击事件进行监听,并判断用户的点击是否在文本框内。当监听到用户点击某一字符区域时(即用户点击区域在文本框内),终端设备可以重新绘制该区域的背景颜色,用于区别未被点击到的其他区域。当监听到用户的点击区域不在任何一个文本框内时,则不做响应。
一种可能实现方式中,可以将包含点击到的字符区域的整行字符区域绘制为与第一颜色不同的第二颜色。然后继续监听在第二颜色对应的字符区域内是否存在拖动事件。如果存在,则可以根据拖动事件调整第二颜色所覆盖的字符区域,调整后的字符区域也就是用户期望点选的字符区域。对于调整后的第二颜色所覆盖的字符区域内的各个字符,可以输出OCR识别模型识别得到的结果,并将结果在终端设备的显示界面中进行展示。
另一种可能实现方式中,可以将包含点击到的字符区域(即离点击位置最近的字符所处区域)绘制为与第一颜色不同的第二颜色。然后继续监听在第二颜色对应的字符区域内是否存在拖动事件。如果存在,则可以根据拖动事件调整第二颜色所覆盖的字符区域,调整后的字符区域也就是用户期望点选的字符区域。对于调整后的第二颜色所覆盖的字符区 域内的各个字符,可以输出OCR识别模型识别得到的结果,并将结果在终端设备的显示界面中进行展示。
另一种可能实现方式中,可以将用户在屏幕上滑动点击到的字符区域(如用户手指在屏幕上的滑动起始位置到终止位置的区域)绘制为与第一颜色不同的第二颜色。然后继续监听在第二颜色对应的字符区域内是否存在拖动事件。如果存在,则可以根据拖动事件调整第二颜色所覆盖的字符区域,调整后的字符区域也就是用户期望点选的字符区域。对于调整后的第二颜色所覆盖的字符区域内的各个字符,可以输出OCR识别模型识别得到的结果,并将结果在终端设备的显示界面中进行展示。本实施例中,该拖动事件可以理解为该终端设备为选中的文本框绘制出第二颜色的背景颜色之后,该文本框内的首尾将出现可拖动的左右控件,而用户可以通过拖动两个控件,从而修改文字选中范围。
本申请实施例通过在原始图片中绘制出第一颜色的提示信息,可以提示用户哪些区域内的字符是可供点选的。当监听用户的点击事件时,可以根据点击位置的不同,将对应区域内的整行文字区域绘制为第二颜色,以告知用户当前已选中的字符内容是哪些。同时,根据实际需要,用户可以通过在绘制有第二颜色的字符区域中进行拖动,调整选中的字符内容。本申请实施例通过在终端界面上实现与用户之间的交互,提高了字符选择的操作便捷性。
第二方面,本申请实施例提供了一种字符选择装置,具有实现上述第一方面中终端设备行为的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
在一个可能的实现方式中,该装置包括用于执行以上第一方面各个步骤的单元或模块。例如,该装置包括:显示模块,用于在终端设备的显示界面中显示原始图片,所述原始图片中包含弯曲文本;处理模块,用于检测所述原始图片,生成包含有直线文本的待识别图片,所述直线文本的文本内容与所述弯曲文本的文本内容一一对应;根据所述待识别图片,识别所述直线文本的所述文本内容,获得与所述直线文本的所述文本内容相对应的连接时序分类序列,所述连接时序分类序列包括了多个字符;计算所述多个字符中每个字符在所述待识别图片中的第一坐标;确定所述多个字符中每个字符对应的所述第一坐标所处的分段区域;根据所述原始图片和所述待识别图片,确定将所述原始图片变换为所述待识别图片时各个分段区域对应的透视变换矩阵;将所述多个字符中每个字符的第一坐标与所述透视变换矩阵相乘,得到所述多个字符中每个字符在所述原始图片中的第二坐标;检测到用户在所述原始图片上的第一操作;然后该显示模块,用于根据所述多个字符中每个字符在所述原始图片中的第二坐标突出显示被选中的字符,所述第一操作用于对所述原始图片上的弯曲文本中的字符进行选择。
可选的,该处理模块,还用于根据所述多个字符中每个字符对应的所述第二坐标,在所述原始图片中生成第一提示信息,所述第一提示信息用于指示用户可对所述原始图片中的字符进行选择。
可选的,该处理模块,具体用于根据所述待识别图片,识别所述直线文本的所述文本内容,获得初始连接时序分类序列;确定所述初始连接时序分类序列的长度,以及确定所 述待识别图片的图片宽度;若所述初始连接时序分类序列的长度与预设的下采样倍数的乘积大于所述待识别图片的图片宽度,则对所述初始连接时序分类序列进行裁剪,获得与所述直线文本的所述文本内容相对应的连接时序分类序列;其中,裁剪后获得的所述连接时序分类序列的长度与预设的下采样倍数的乘积小于或等于所述待识别图片的图片宽度。
可选的,该处理模块,具体用于依次裁剪所述初始连接时序分类序列的头部元素或尾部元素;当裁剪任一头部元素或尾部元素后,计算裁剪后的初始连接时序分类序列的长度与预设的下采样倍数的乘积是否小于或等于所述待识别图片的图片宽度;若裁剪后的初始连接时序分类序列的长度与预设的下采样倍数的乘积小于或等于所述待识别图片的图片宽度,则停止剪裁,输出与所述直线文本的所述文本内容相对应的连接时序分类序列。
可选的,该处理模块,具体用于确定所述连接时序分类序列中所述多个字符中每个字符的字符边界坐标,所述字符边界坐标包括左边界坐标和右边界坐标;根据所述多个字符中每个字符的字符边界坐标,计算所述多个字符中每个字符在所述待识别图片中的第一坐标。
可选的,该处理模块,具体用于针对所述连接时序分类序列中任一字符,获取所述字符的原始右边界坐标,以及下一字符的原始左边界坐标;计算所述原始右边界坐标与所述原始左边界坐标的平均值;基于所述平均值,确定所述字符的右边界坐标,以及下一字符的左边界坐标。
可选的,该处理模块,具体用于分别确定所述字符的第一字符类型和所述下一字符的第二字符类型,所述第一字符类型和所述第二字符类型分别具有相应的偏移量;计算所述平均值减去所述第一字符类型对应的偏移量,得到第一差值,将所述第一差值作为所述字符的右边界坐标;计算所述平均值加上所述第二字符类型对应的偏移量,得到第二和值,将所述第二和值作为所述下一字符的左边界坐标。
可选的,所述每个字符在所述待识别图片中的第一坐标包括第一顶点坐标、第二顶点坐标、第三顶点坐标以及第四顶点坐标;该处理模块,具体用于将所述多个字符中每个字符的左边界坐标和右边界坐标分别与预设的下采样倍数相乘,得到所述多个字符中每个字符在所述待识别图片中的第一顶点坐标和第二顶点坐标;根据所述第一顶点坐标、第二顶点坐标以及所述待识别图片的图片高度,确定所述多个字符中每个字符在所述待识别图片中的第三顶点坐标和第四顶点坐标。
可选的,该处理模块,具体用于遍历所述每个字符的第一坐标以及所述待识别图片的分段区域,分别确定所述第一顶点坐标所处第一分段区域,所述第二顶点坐标所处第二分段区域,所述第三顶点坐标所处第三分段区域和所述第四顶点坐标所处第四分段区域。
可选的,该处理模块,具体用于将所述第一顶点坐标与所述第一分段区域的透视变换矩阵相乘得到第三坐标,将所述第二顶点坐标与所述第二分段区域的透视变换矩阵相乘得到第四坐标,将所述第三顶点坐标与所述第三分段区域的透视变换矩阵相乘得到第五坐标,将所述第四顶点坐标与所述第四分段区域的透视变换矩阵相乘得到第六坐标;其中,所述第三坐标、所述第四坐标、所述第五坐标和所述第六坐标作为字符在所述原始图片中的第二坐标。
可选的,该处理模块,具体用于根据所述多个字符中每个字符对应的第二坐标,在所述原始图片中将所述多个字符中每个字符的字符区域绘制为第一颜色;或,根据所述多个字符中每个字符对应的第二坐标,在所述原始图片中为所述多个字符中每个字符的字符区域绘制文本框。
可选的,该处理模块,还用于当监听到所述用户在字符区域的点击事件时,将包含点击到的字符区域的整行字符区域绘制为第二颜色;监听所述用户在所述第二颜色对应的字符区域的拖动事件,根据所述拖动事件调整所述第二颜色所覆盖的字符区域;
识别并展示所述第二颜色所覆盖的字符区域内的各个字符。
可选的,该处理模块,还用于当监听到所述用户在字符区域的点击事件时,将距离所述点击事件指示的点击位置最近的字符区域绘制为第二颜色;监听所述用户在所述第二颜色对应的字符区域的拖动事件,根据所述拖动事件调整所述第二颜色所覆盖的字符区域;识别并展示所述第二颜色所覆盖的字符区域内的各个字符。
可选的,该处理模块,还用于当监听到所述用户在字符区域的滑动事件时,将包含滑动事件指示的字符区域绘制为第二颜色;监听所述用户在所述第二颜色对应的字符区域的拖动事件,根据所述拖动事件调整所述第二颜色所覆盖的字符区域;识别并展示所述第二颜色所覆盖的字符区域内的各个字符。
可选的,还包括存储模块,用于保存字符选择装置必要的程序指令和数据。
可选的,还包括获取模块,用于获取该原始图片。
在一种可能的实现方式中,该装置包括:处理器和收发器,该处理器被配置为支持字符选择装置执行上述第一方面提供的方法中相应的功能。收发器用于指示字符选择装置与通信系统中其他设备之间的通信,比如将选中的字符发送给其他设备。可选的,此装置还可以包括存储器,该存储器用于与处理器耦合,其保存字符选择装置必要的程序指令和数据。
在一种可能的实现方式中,当该字符选择装置为字符选择装置内的芯片时,该芯片包括:处理模块和收发模块。该收发模块例如可以是该芯片上的输入/输出接口、管脚或电路等。该处理模块例如可以是处理器,此处理器用于检测所述原始图片,生成包含有直线文本的待识别图片,所述直线文本的文本内容与所述弯曲文本的文本内容一一对应;根据所述待识别图片,识别所述直线文本的所述文本内容,获得与所述直线文本的所述文本内容相对应的连接时序分类序列,所述连接时序分类序列包括了多个字符;计算所述多个字符中每个字符在所述待识别图片中的第一坐标;确定所述多个字符中每个字符对应的所述第一坐标所处的分段区域;根据所述原始图片和所述待识别图片,确定将所述原始图片变换为所述待识别图片时各个分段区域对应的透视变换矩阵;将所述多个字符中每个字符的第一坐标与所述透视变换矩阵相乘,得到所述多个字符中每个字符在所述原始图片中的第二坐标;检测到用户在所述原始图片上的第一操作,所述第一操作用于对所述原始图片上的弯曲文本中的字符进行选择。该处理模块可执行存储单元存储的计算机执行指令,以支持字符选择装置执行上述第一方面提供的方法。可选地,该存储单元可以为该芯片内的存储单元,如寄存器、缓存等,该存储单元还可以是位于该芯片外部的存储单元,如只读存 储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
在一种可能的实现方式中,该字符选择装置包括:处理器,射频电路和天线。其中处理器用于实现对各个电路部分功能的控制并实现上述第一方面的方法。而射频电路可以对该处理器生成的待发送信息进行模拟转换、滤波、放大和上变频等处理后,再经由天线发送给通信系统中的其他设备。可选的,该装置还包括存储器,其保存字符选择装置必要的程序指令和数据。
在一种可能实现方式中,该装置包括通信接口和逻辑电路,该逻辑电路,用于检测所述原始图片,生成包含有直线文本的待识别图片,所述直线文本的文本内容与所述弯曲文本的文本内容一一对应;根据所述待识别图片,识别所述直线文本的所述文本内容,获得与所述直线文本的所述文本内容相对应的连接时序分类序列,所述连接时序分类序列包括了多个字符;计算所述多个字符中每个字符在所述待识别图片中的第一坐标;确定所述多个字符中每个字符对应的所述第一坐标所处的分段区域;根据所述原始图片和所述待识别图片,确定将所述原始图片变换为所述待识别图片时各个分段区域对应的透视变换矩阵;将所述多个字符中每个字符的第一坐标与所述透视变换矩阵相乘,得到所述多个字符中每个字符在所述原始图片中的第二坐标;检测到用户在所述原始图片上的第一操作,所述第一操作用于对所述原始图片上的弯曲文本中的字符进行选择。
其中,上述任一处提到的处理器,可以是一个通用中央处理器(central processing unit,CPU),处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制上述各方面弯曲文本的字符选择方法的程序执行的集成电路。
第三方面,本申请实施例提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述第一方面所述的弯曲文本的字符选择方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被终端设备的处理器执行时实现上述第一方面所述的弯曲文本的字符选择方法。
第五方面,本申请实施例提供了一种计算机程序产品,当所述计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面所述的弯曲文本的字符选择方法。
本申请的一些实施例可以包括以下有益效果中的一项或多项:
本申请的一些实施例,通过利用OCR检测识别模型中现有的卷积神经网络(Convolutional Neural Network,CNN)结构,根据CNN下采样(8倍)的特点,建立了检测识别模型输出的CTC序列索引到原始图片坐标之间的映射关系,整个过程使用的模型简单,需要标注的数据量少;在处理时,相对于现有技术中需要提供原图或者矫正后的图,或者需要遍历图上所有像素的方案,本申请的一些实施例可以使用实时检测或采集到的图片,计算速度快,可快速、方便地适配和部署到现有的OCR模型中。其次,本申请的一些实施例在确定字符边界时,可以根据不同语种设置超参对字符边界进行微调,保证字 符边界识别的准确性,不仅适用于大部分语种,同时也支持针对符号的选择。第三,本申请的一些实施例通过记录每个字符在待识别图片中的四个顶点坐标,并将四个顶点坐标按照分段区域分别映射至原始图片得到每个字符在原始图片中的坐标,使得坐标输出的颗粒度更细,在执行手动选择操作时,对于文字与文字、文字与符号之间的识别准确,可以快速地实现对相邻字符的分割,在实际应用中具有更强的普适性。第四,得益于OCR检测识别模型的鲁棒性,本申请的一些实施例所提供的字符选择方法在自然场景下对字符的定位也具有更强的鲁棒性。第五,本申请的一些实施例通过将识别得到的各个字符的坐标与原始图片中各个字符的坐标(为便于区分,本文中也称为第二坐标)关联,并根据第二坐标绘制出相应的已识别出的文本区域,可以支持用户在上述已识别的文本区域中进行任意的选择和范围调整。第六,上述文字识别和选择场景还可以与翻译、扫题等功能相关联,待用户选定后,可以直接翻译选中的文字,或者解答识别到的题目;或者,通过解析用户选中的字符串,如果包含电话、邮箱等格式的文本内容,还可以提取出单独的卡片,方便用户直接拨打电话或发送邮件,具有较强的易用性和实用性。
附图说明
图1a至图1b是文字点选场景的场景示意图;
图2a至图2c是现有技术中的一种弯曲文本的文字点选方案的识别结果的示意图;
图3是现有技术中的一种CTC序列与文字识别或语音识别结果的对应关系示意图;
图4为本申请实施例中的OCR检测识别模型与字符选择模型之间的一个流程与逻辑关系示意图;
图5为本申请实施例中的弯曲文本的一个检测处理过程示意图;
图6为本申请实施例中的弯曲文本变换为直线文本的一个处理流程示意图;
图7为本申请实施例中获取待识别图片上的文字坐标的一个整体处理过程示意图;
图8为本申请实施例中获取待识别图片上的文字坐标的一个算法流程示意图;
图9为本申请实施例中弯曲文本的字符选择方法所适用于的手机的硬件结构示意图;
图10为本申请实施例中弯曲文本的字符选择方法所适用于的手机的软件结构示意图;
图11为本申请实施例中弯曲文本的字符选择方法的一个实施例示意图;
图12为本申请实施例在待识别图片上各个字符的坐标的展示效果示意图;
图13为本申请实施例中待识别图片上坐标显示效果与矩形片段结合的效果示意图;
图14a至图14b为本申请实施例提供的字符选择控件的一个实施例示意图;
图15为本申请实施例中弯曲文本的字符选择方法的另一个实施例示意图;
图16a至图16c为本申请实施例中弯曲文本的字符选择方法选中字符时的一个实例效果展示示意图;
图17为本申请实施例中在用户选中目标字符后,输出已选中的目标字符的示意图;
图18为本申请实施例中多个应用程序选择界面示意图;
图19为本申请实施例中字符选择装置的一个示例性结构框图;
图20为本申请实施例中字符选择装置的另一个示例性结构框图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着新应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。在本申请中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行次序,只要能达到相同或者相类似的技术效果即可。本申请中所出现的单元的划分,是一种逻辑上的划分,实际应用中实现时可以有另外的划分方式,例如多个单元可以结合成或集成在另一个系统中,或一些特征可以忽略,或不执行,另外,所显示的或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元之间的间接耦合或通信连接可以是电性或其他类似的形式,本申请中均不作限定。并且,作为分离部件说明的单元或子单元可以是也可以不是物理上的分离,可以是也可以不是物理单元,或者可以分布到多个电路单元中,可以根据实际的需要选择其中的部分或全部单元来实现本申请方案的目的。本申请中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。如在本申请的说明书和所附权利要求书中所使用的那样,单数表达形式“一个”、“一种”、“所述”、“上述”、“该”和“这一”旨在也包括例如“一个或多个”这种表达形式,除非其上下文中明确地有相反指示。还应当理解,在本申请实施例中,“一个或多个”是指一个、两个或两个以上;“和/或”,描述关联对象的关联关系,表示可以存在三种关系;例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A、B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。
为了便于理解,首先对现有技术的文字点选方案进行介绍:在手机智能镜头(如Google Lens、百度智能镜头和华为HiVsion)识物场景中,只有镜头在AR场景可以完成多物体检测和文本检测识别(OCR)任务,并且会提供物体和文本框位置,用户可以点击手机屏幕中的文本行,此时屏幕会定帧,用户可以点选检测和识别到的文字内容,其效果类似于文本文档里的用鼠标对文本内容逐字符拖选的功能。针对点选到的内容,用户可以选择复制、翻译或搜索等后续操作。文字点选场景分为直线文本场景(如图1a所示)和弯曲文本场景(如图1b所示)。直线文本场景中,文本行都为带角度的矩形框、平行四边形框或者四边形框,文本行一般由四个顶点描述。在弯曲文本场景中,为了描述弯曲的走向,文本框一般是由一个多边形组成,该多边形又可以分解为诸多四边形,由这些四边形拼接成该多边 形。在直线文本场景中,由于原图和待识别图之间只是简单的透视变换映射关系,如果获取了待识别图上各个文字的坐标,可以很容易获取在原图上直线文本中每个文本的坐标。然而,在弯曲文本场景中,由于较强的不连续性,以及弯曲文本本身较为复杂的描述方式,获取各个文字的坐标要比直线场景中要困难得多。
如图2a-图2c所示,是现有技术中的一种文字点选方案的识别结果的示意图。该点选方案但存在以下缺陷:
1)检测框存在严重的重合现象;
2)检测框均为直线文本框的拼接;
3)部分场景高亮的区域有误,对用户会产生误导作用;
4)文字切分不准确;
5)用户视觉效果差;
6)用户点选体验差。
前三点说明,在处理弯曲文本时,本质上用的是直线文本的逻辑。这种处理方式首先不能很好的描述弯曲文本,在直线文本交界处必然会存在严重的锯齿和不连续的现象;其次,在语义连贯的文本交接处被切分开,会严重影响用户体验。例如,在图2a所示的结果中,由于弯曲文本框由直线文本框的逻辑进行描述,在显示文本框时,文本框出现严重的锯齿,若只点选的字符位于同一直线框未重复的区域,则识别出来的结果正好是点选的字符,如图2a中点选的“技术”两字,识别结果也是“技术”两字。而若点选的字符位于多个直线文本框内内,则有可能出现识别出的文字在语义连贯的文本交接处被切分开,如图2b中点选的文字为“技术有限公司消费”,识别结果为“技术合采有限公司公司消费”。而若为了能准确识别出结果,可以将高亮区域根据点选的字符进行扩大,如图2c所示,点选的字符为“公司消费者BG智慧”,此时高亮区域可以为如图2c中的较大矩形,此时用户视觉效果差。
因此,为了解决上述各种点选方案中存在的问题,提高文字选择的效率和准确率,本申请实施例提供了一种基于OCR识别模型输出的CTC序列的弯曲文本的字符选择方法。
本申请实施例提供的弯曲文本的字符选择方法主要使用到经OCR识别模型识别后,输出的CTC序列,基于输出的CTC序列,获取各个字符供用户在屏幕中逐字符地进行手动选择。
连接时序分类是一种损失度(Loss)计算方法,主要应用在文字识别和语音识别等序列识别模型中,大多数序列识别模型采用CNN+循环神经网络(Recurrent Neural Network,RNN)+CTC的结构。使用CTC Loss代替基于逻辑回归的损失度计算方法(Softmax Loss),训练样本无需对齐,CTC Loss一般具有以下两个特点:
1)引入空白(blank)字符,解决某些位置没有字符的问题;
2)通过递推算法,快速计算梯度。
如图3所示,是现有技术中的一种CTC序列与文字识别或语音识别结果的对应关系示意图。在图3中,符号“∈”代表blank,作为区分重复字符的占位符。即在两个blank之间的重复字符将会被合并成一个字符。例如,对于图3中输入的语音序列,经识别出的对 应文本为“wwo∈rrr∈lld”,在对两个blank之间重复的“w”和“r”进行合并后,可以输出最终的识别结果,即单词“world”。
如图4所示,是本申请一实施例提供的OCR检测识别模型与字符选择模型之间的流程与逻辑关系示意图。其中,OCR检测模型可以对输入图片(即包含弯曲文本的原始图片)进行检测,得到多边形弯曲文本框。在矫正流程中,模型通过分段透视变换将弯曲文本框转化为包含直线文本(该直线文本与原始图片中的弯曲文本的文本内容一一对应)的待识别图片。上述输入图片可以是通过手机镜头、扫描仪等设备拍摄或扫描得到的图像,待识别图片可以看作是用于提供给OCR识别模型进行处理的输入数据。在采用CNN网络对待识别图片进行特征提取后,OCR识别模型可以输出包含识别出的字符的CTC序列。
可选的,在进行文本框检测的过程中,弯曲文本框的处理流示意图可以如图5所示。其中,图5中的a为原始图片中的弯曲文本;然后按照图5中的b所示拟合中心线;进一步如图5中的c所示拟合文本框和旋转角度;在拟合好文本框之后,按照图5中的d所示通过非极大抑制去掉冗余框,并进行平滑处理得到如图5中所e所示的弯曲文本框。
而在通过透视变换将原始图片中的弯曲文本生成待识别图片中的直线文本的具体过程可以如图6所示。其中,图6中的a为原始图片中的弯曲文本,然后通过弯曲文本检测得到如图6中的b所示的弯曲文本框;图6中的c用于指示该弯曲文本框由多个四边形组成,而虚线为四边形之间的交接处;通过透视变换得到如图6中的d所示的直线文本框。
需要说明的是,在现有技术中,基于OCR的检测识别模型大部分均使用CNN+RNN+CTC进行序列识别流程。其中,CNN用于特征提取,RNN+CTC用于序列识别。而本申请实施例提供的字符选择方法所对应的识别模型,是仅采用CNN作为网络结构,下采样8倍后直接在最后一层特征图上滑窗,从而得到CTC序列,整个识别模型并未涉及到任何RNN结构。
OCR检测识别模型输出的CTC序列可以作为本申请实施例的字符选择模型的输入序列,通过确定不同字符的在CTC序列中的索引坐标,然后将该索引坐标映射至待识别图片中得到待识别图的坐标,然后再通过分段逆透视变换将该待识别图片上的坐标映射至原图片上,可以得到每个字符在原图片中的坐标,供用户逐字符选择。
以用户使用手机镜头对图片进行检测,并在实时呈现的手机屏幕上对识别出的文字进行手动选择为例。用户可以通过打开手机相机,控制镜头对准某一图片或物体。由于用户手持手机的过程中可能出现晃动,导致屏幕中呈现的图像也可能晃动,因此用户可以通过点击屏幕中的某一位置,将图像固定住。在用户点击屏幕的那一刻,相机实时地采集到的图像,即该时刻呈现在屏幕中的图像,便是需要输入至OCR检测识别模型的输入图片。
参见图7和图8,分别是本申请实施例提供的获取待识别图片上的文字坐标的整体处理过程示意图和算法流程示意图。针对上述输入的待识别图片,OCR检测识别模型可以自动对图片中的文本区域进行检测,并采用CNN网络作为OCR识别模型的网络结构,可以下采样8倍后直接在CNN最后一层特征图上滑窗,输出检测识别模型的CTC序列,字符选择模型在对上述序列进行裁剪后,可以得到满足后续处理要求的CTC序列,即图7中的序列[bl,bl,成,bl,bl,bl,本,bl,bl,bl,会,bl,bl,bl,计,bl,bl,bl,人,bl,bl,bl,员,bl,bl]。
按照图8所示的算法流程,字符选择模型可以针对图7中的CTC序列,计算出不同字符的索引坐标,即“成本会计人员”各个字符在CTC序列中的索引坐标,即{[2,3],[6,7],[10,11],[14,15],[18,19],[22,23]}。上述索引坐标将被映射回待识别图片(即输入图片检测得到的待识别图片),得到“成本会计人员”各个字符在待识别图片中的坐标,即各个字符的左右边界。其由于字符选择模型已计算得到待识别图片中每个字符的左右边界,再结合待识别图片的固定高度,字符选择模型可以得到待识别图片中每个字符上下左右四个点的坐标。另一方面,在通过上述方式得到字符的坐标之后(即也可以理解为在图像显示界面处于定帧状态时),将识别出来的各个文本框绘制第一颜色(比如浅色高亮背景颜色),然后终端设备在监听到用户在文本框区域的点击的点击事件之后,将选中的文本绘制第二颜色(如深色高亮背景颜色)。字符选择模型可以监听用户在手机屏幕上的点击事件。一种可能实现方式中,当监听到用户点击字符区域时,通过计算当前点击的行位置,可以将点击到的整行文字设置为选中状态,将这行文字绘制为深色高亮背景。同时,可以在整行文字的首尾出现可拖动的把手,用户可以通过拖动把手修改选中的文字范围。另一种可能实现方式中,当监听到用户点击字符区域时,通过计算当前点击的位置,选择距离该点击位置最近的字符设置为选中状态,将该字符绘制为深色高亮背景,然后在该字符的左右边界出现可拖动的把手,用户可以通过拖动把手修改选中的文字范围。另一种可能实现方式中,终端设备监听用户的移动区域,通过计算移动区域的起启位置和终结位置,将其包含的区域设置为选中状态,并将该区域绘制为深色高亮背景,同时在区域的首尾出现可拖动的把手,用户可以通过拖动把手修改选中的文字范围。
在拖动把手的过程中,若用户拖动其中一个把手,字符选择模型可以从用户拖动把手产生的点击事件中获取用户的点击位置,将距离该位置最近的字符作为当前选中的字符。在根据用户拖动把手的情况,重复上述步骤后,可以得到用户最终选中的文字区域,并绘制出深色背景。
在用户停止拖动把手后,例如,可以等待1秒,若没有监听到用户的点击事件,可以认为用户当前点选的文字已经满足预期,字符选择模型可以在手机屏幕上弹出文字卡片,将用户选中的文字展示出来。同时,文字卡片上还可以提供其他操作按钮,对于展示出来的文字,用户可以通过点击不同的按钮,指示与该按钮对应的应用程序根据所展示的文字,执行相应的操作。例如,对展示的文字进行翻译、搜索等等。
下面结合具体的实施例对本申请的弯曲文本的字符选择方法进行介绍。
本申请实施例提供的弯曲文本的字符选择方法可以应用于手机、平板电脑、可穿戴设备、车载设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)等电子设备上,本申请实施例对电子设备的具体类型不作任何限制。
以电子设备为手机为例。应该理解的是,电子设备900可以具有比图中所示的更多的或者更少的部件,可以组合两个或多个的部件,或者可以具有不同的部件配置。图中所示出的各种部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬 件和软件的组合中实现。
电子设备900可以包括:处理器910,外部存储器接口920,内部存储器921,通用串行总线(universal serial bus,USB)接口930,充电管理模块940,电源管理模块941,电池942,天线1,天线2,移动通信模块950,无线通信模块960,音频模块970,扬声器970A,受话器970B,麦克风970C,耳机接口970D,传感器模块980,按键990,马达991,指示器992,摄像头993,显示屏994以及用户标识模块(subscriber identification module,SIM)卡接口995等。其中传感器模块980可以包括压力传感器980A,陀螺仪传感器980B,气压传感器980C,磁传感器980D,加速度传感器980E,距离传感器980F,接近光传感器980G,指纹传感器980H,温度传感器980J,触摸传感器980K,环境光传感器980L,骨传导传感器980M等。
可以理解的是,本发明实施例示意的结构并不构成对电子设备900的具体限定。在本申请另一些实施例中,电子设备900可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器970可以包括一个或多个处理单元,例如:处理器910可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
其中,控制器可以是电子设备900的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器910中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器910中的存储器为高速缓冲存储器。该存储器可以保存处理器910刚用过或循环使用的指令或数据。如果处理器910需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器910的等待时间,因而提高了系统的效率。
在一些实施例中,处理器910可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
电子设备900通过GPU,显示屏994,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏994和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器910可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏994用于显示图像,视频等。显示屏994包括显示面板。显示面板可以采用液 晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,电子设备900可以包括1个或N个显示屏994,N为大于1的正整数。
电子设备900可以通过ISP,摄像头993,视频编解码器,GPU,显示屏994以及应用处理器等实现拍摄功能。
ISP用于处理摄像头993反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头993中。
摄像头993用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,电子设备900可以包括1个或N个摄像头993,N为大于1的正整数。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备900的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
图10是本发明实施例的电子设备900的软件结构框图。
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。
应用程序层可以包括一系列应用程序包。
如图10所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息等应用程序(也可以称为应用)。
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。
如图10所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器,本地Profile管理助手(Local Profile Assistant,LPA)等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供电子设备900的通信功能。例如通话状态的管理(包括接通,挂断等)。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话界面形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。
安卓运行时(Android Runtime)包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),二维图形引擎(例如:SGL)等。
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了二维(2-Dimensional,2D)和三维(3-Dimensional,3D)图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。
三维图形处理库用于实现3D图形绘图,图像渲染,合成,和图层处理等。
2D图形引擎是2D绘图的绘图引擎。
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动,虚拟卡驱动。
下面结合捕获拍照场景,示例性说明电子设备900软件以及硬件的工作流程。
当触摸传感器980K接收到触摸操作,相应的硬件中断被发给内核层。内核层将触摸操作加工成原始输入事件(包括触摸坐标,触摸操作的时间戳等信息)。原始输入事件被存储在内核层。应用程序框架层从内核层获取原始输入事件,识别该输入事件所对应的控件。以该触摸操作是触摸单击操作,该单击操作所对应的控件为相机应用图标的控件为例,相机应用调用应用框架层的接口,启动相机应用,进而通过调用内核层启动摄像头驱动,通 过摄像头993捕获静态图像或视频。
以下实施例可以在具有上述硬件结构/软件结构的手机900上实现。以下实施例将以手机900为例,对本申请实施例提供的基于字符识别的字符选择方法进行说明。
参照图11,示出了本申请一实施例提供的弯曲文本的字符选择方法的示意性步骤流程图,作为示例而非限定,该方法可以应用于上述手机900中,该方法具体可以包括如下步骤:
1101、检测该原始图片,生成包含直线文本的待识别图片,该原始图处理包括弯曲文本,该直线文本的文本内容与该弯曲文本的文本内容一一对应。
需要说明的是,本实施例是针对OCR检测识别模型输出后的CTC序列进行处理,从而在屏幕中呈现出可供用户进行手动选择的提示信息的角度,来对本申请实施例的弯曲文本的字符选择方法进行的介绍,即本实施例描述的是图4中的字符选择模型接收到输入CTC序列,并对输入CTC序列进行处理的过程。
在本申请实施例中,待识别图片可以是通过对原始图片进行检测得到的,上述原始图片可以是在用户打开手机相机后,实时呈现在手机屏幕上的图像,该图像中包含有弯曲文本的待识别文本内容。而该待识别图片为将该弯曲文本变换为直线文本之后的图片。
1102、识别所述文本内容,获得与所述文本内容相对应的CTC序列。
在本申请实施例中,与文本内容相对应的CTC序列可以是指对于上述输入CTC序列作初步处理后的序列。CTC序列是OCR识别模型的中间输出结果,识别模型输出CTC序列后,通过一定的规则解析CTC序列可获得最终的识别结果。任一CTC序列均可以包含有多个CTC序列元素,在本申请中,即待识别序列通过CNN最后一层特征图通过4×4卷积核滑窗得到的元素,其可能为待识别序列的一个字符,或者为blank;元素的值可以是识别到的字符,也可以是字符的索引。
本实施例中的初步处理可以包括对输入CTC序列进行裁剪和参数初始化等处理过程。
在具体实现中,对输入CTC序列进行裁剪,可以基于后续处理过程中的CNN下采样倍数来进行,使裁剪后的CTC序列与下采样倍数的乘积不大于待识别图片的图片宽度。在进行裁剪时,可以按先剪裁输入CTC序列头部的字符,然后剪裁尾部的字符,再次剪裁头部的字符的顺序进行。
在每裁剪完一个字符后,可以计算一次剩余的CTC序列长度与CNN下采样倍数的乘积是否小于或等于待识别图片的图片宽度。若乘积已经小于或等于待识别图片的图片宽度,则停止裁剪,执行下一步骤;否则继续按上述步骤裁剪下一个字符。
例如,假设待识别图片中的输入CTC序列为:
[bl,bl,bl,成,bl,bl,bl,本,bl,bl,bl,会,bl,bl,bl,计,bl,bl,bl,人,bl,bl,bl,员,bl,bl]
其中,bl为空白字符(blank),待识别图片的尺寸为32*199(高*宽)像素。
1.对输入的CTC序列进行裁剪
a)规则:CTC序列长度(26)乘以CNN下采样比例(8)不大于图片宽度(199);
b)具体执行步骤:按先裁头,后裁尾,再裁头的顺序裁剪;
c)上述步骤中有任何一个步骤满足a)中的条件则返回。
对于示例中的输入CTC序列,可得到如下裁剪过程:
[bl,bl,bl,成,bl,bl,bl,本,bl,bl,bl,会,bl,bl,bl,计,bl,bl,bl,人,bl,bl,bl,员,bl,bl]
↓(裁头)
[bl,bl,成,bl,bl,bl,本,bl,bl,bl,会,bl,bl,bl,计,bl,bl,bl,人,bl,bl,bl,员,bl,bl]
↓(裁尾)
[bl,bl,成,bl,bl,bl,本,bl,bl,bl,会,bl,bl,bl,计,bl,bl,bl,人,bl,bl,bl,员,bl]
裁掉头尾后满足条件(剩余序列长度24*下采样倍数8=192<图片宽度199),则返回当前CTC序列。
对剪裁后的CTC序列进行参数初始化可以包括:
a)初始化左边界(-1)、右边界(-1)、当前状态(0)、之前状态(blank);
b)初始化坐标数组和内容数组,初始均为空。
字符边界可以包括左边界和右边界,即CTC序列中每个元素或字符的左端和右端的位置,字符边界可分为在CTC序列中的边界和在原图中的边界。本步骤需要初始化的是在CTC序列中的边界。
当前状态可以是指每个元素的类型。例如,数字、中文、日文、韩文或英文等等。
坐标数组可以是指存放所有字符索引坐标的数组,内容数组则是存放每个字符或单词内容的数组。在初始化时,可以将坐标数组和内容数组均置为空。
1103、确定所述CTC序列中的每个字符的字符边界坐标。
在本申请实施例中,可以通过遍历CTC序列中的每一个元素,来记得得到该元素的字符左边界和右边界。
例如,对于如下的CTC序列:
[bl,bl,成,bl,bl,bl,本,bl,bl,bl,会,bl,bl,bl,计,bl,bl,bl,人,bl,bl,bl,员,bl]
可以首先对第一个字符“成”进行处理,记录下该字符的右边界以及下一字符“本”的左边界,按照上述方式,处理完全部字符后,得的每个不同字符及其对应的数组索引范围。
在初步确定了字符或单词的左右边界之后,由于检测识别模型对不同语种输出的CTC序列的处理规律不一致,可能需要针对不同语种对左右边界进行微调。因此,在本申请实施例中,对于不同类型的字符,还可以按照一定的规则对其边界进行微调。
在对字符边界进行微调时,可以按照如下步骤进行:
a)首先求相邻字符的左右边界的坐标的均值;
b)均值分别减掉或者加上一定的偏移量(offset),可得到各个字符新的左右边界。
在具体实现中,对于相邻的两个字符,可以首先根据前一字符的右边界的坐标和后一 字符的左边界的坐标,计算得到左右边界的坐标的均值。例如,上述示例的CTC序列中,对于相邻的字符“成”和“本”,可以根据字符“成”的右边界的坐标,以及字符“本”的左边界的左边,计算得到二者的坐标均值。
然后,根据上述坐标均值,减掉或加上一定的偏移量,便可以得到各个字符新的左右边界。其中,对于相邻的两个字符中的前一个字符,可以使用计算得到的坐标均值减去上述偏移量,得到前一字符的新的右边界坐标;而对于后一字符,则可以使用坐标均值加上上述偏移量,得到后一字符的新的左边界坐标。
在本申请实施例中,对于不同语种或类型的字符,其偏移量offset可以按照如下规则确定:
对于中文、日文或韩文,偏移量offset可以为1/8;对于西文(拉丁文),偏移量offset可以为1/2;对于数字,偏移量offset也可以为1/2。
因此,在按照语种对各个字符的左右边界进行微调后,对于上述示例的CTC序列,可以得到如下的坐标结果:
[(1.0,4.375),(4.625,8.375),(8.625,12.375),(12.625,16.375),(16.625,20.375),(20.625,23)]
上述坐标结果即是“成本会计人员”六个字符在CTC序列中的坐标。
1104、根据所述每个字符的字符边界坐标,计算所述每个字符在所述待识别图片中的第一坐标。
通常,OCR识别以32*512像素大小的尺寸作为输入,输出的特征图大小为4*65像素,下采样倍数为8,最后在进行识别时,采用4*4像素的卷积核进行滑窗,即CTC序列中一个元素对应原图8个像素宽度覆盖的范围。
因此,在本申请实施例中,将前一步骤计算得到的每个字符的字符边界乘以CNN下采样倍数,可以得到每个字符在待识别图片中的坐标。
例如,对于如下坐标结果:
[(1.0,4.375),(4.625,8.375),(8.625,12.375),(12.625,16.375),(16.625,20.375),(20.625,23)]
在将每个字符的左右边界坐标乘以下采样倍数8之后,可以得到每个字符在待识别图片中的坐标,该坐标的单位均为(一个)像素:
[(8,35.0),(37.0,67.0),(69.0,99.0),(101.0,131.0),(133.0,163.0),(165.0,184)]
通常,如果按照上述坐标进行展示,则可能存在每行序列的首尾两个字符会有部分像素在坐标对应的像素点之外。例如,对于首字符“成”,如果按照左边界坐标8进行展示,则该字符可能会有部分像素在坐标8更左侧;类似地,对于尾字符“员”,如果按照右边界坐标184进行展示,则该字符可能会有部分像素在坐标184更右侧。
因此,为了保证最终映射得到的坐标所对应的区域能完整地覆盖每个字符的像素点,还可以对首尾字符的坐标进行微调。
在具体实现中,微调的对应可以是首字符在待识别图片中的左边界坐标,以及尾字符 在待识别图片中的右边界坐标。微调首尾字符坐标后得到的各个字符在待识别图中的左右边界的最终坐标可以是:
[(2,35.0),(37.0,67.0),(69.0,99.0),(101.0,131.0),(133.0,163.0),(165.0,197)]
即,将首字符左边界坐标减去某个数值,将尾字符右边界坐标加上某个数值,上述减去或加上的数值可以是相同的,也可以是不同的,本实施例对此不作限定。
上述得到的坐标仅仅是各个字符在待识别图片中的左右边界坐标,因此,在后续处理过程中,可以结合待识别图片的高度,得到每个字符四个顶点的坐标,即每个字符左上、左下、右上和右下四个顶点的坐标。
由于OCR识别采用32*512像素的固定尺寸作为输入数据,因此,其高度可以固定为32像素。因此,对于上述示例中的首字符“成”,其四个顶点的坐标可以表示为[(2,32),(2,0),(35.0,0),(35.0,32)]。
如图12所示,即是在待识别图片上各个字符的坐标的展示效果示意图。
1105、确定所述第一坐标所处分段区域。
如图6所示,该弯曲文本通过分段透视变换生成直线文本时,各个字符都存在一个相应的矩形片段。如图13的矩形分界线示意图所示,每个字符的顶点坐标并不一定处于同一个矩形片段内。如“成”的四个顶点坐标均处于第一个矩形片段内。但是“会”的左下和左上处于第三个矩形片段内,但是右下和右上处于第四个矩形片段内。
1106、根据所述原始图片和所述待识别图片,确定将所述原始图片变换为所述待识别图片时的各个分段区域对应的透视变换矩阵。
1107、将所述每个字符的第一坐标与所述透视变换矩阵相乘,得到所述每个字符在所述原始图片中的第二坐标。
对于每个顶点,根据所在小矩形对应的透视变换关系,乘以逆透视变换矩阵,得到该字符的各顶点坐标在原始图片的弯曲检测框中的坐标。
1108、根据所述第二坐标,在所述原始图片中生成字符选择控件。
在本申请实施例中,对于识别得到的原始图片中各个文本区域内每个字符的第二坐标,可以根据该第二坐标,在原始图片中生成字符选择控件。上述字符选择控件可以是将各个文本区域中的字符行绘制为高亮背景,以告知用户当前已识别出的文字范围。
另一方面,还可以对用户在手机屏幕,即通过手机屏幕所呈现的原始图片上的点击事件进行监听。当监听到用户点击字符区域时,通过计算当前点击的行位置,可以将点击到的整行文字设置为选中状态,将这行文字绘制为深色高亮背景。同时,可以在整行文字的首尾出现可拖动的把手,用户可以通过拖动把手修改选中的文字范围。
如图14a和图14b所示,是本申请实施例的一种字符选择控件的示意图。对于已识别出的文字范围,可以以字符行为单位,将每行字符的背景绘制为图14a所示的某种颜色。当监听到用户在某行字符范围内的点击事件时,可以如图14b所示,在被点击的该行字符的首尾绘制出可拖动的把手,并将该行字符绘制为与图14a不同的另一种颜色,以告知用户可以针对当前行的字符进行选择。
本申请实施例,通过利用OCR检测识别模型中现有的卷积神经网络(Convolutional Neural Network,CNN)结构,根据CNN下采样(8倍)的特点,建立了检测识别模型输出的CTC序列索引到原始图片坐标之间的映射关系,整个过程使用的模型简单,需要标注的数据量少;在处理时,相对于现有技术中需要提供原图或者矫正后的图,或者需要遍历图上所有像素的方案,本申请实施例可以使用实时检测或采集到的图片,计算速度快,可快速、方便地适配和部署到现有的OCR模型中。其次,本申请实施例在确定字符边界时,可以根据不同语种设置超参对字符边界进行微调,保证字符边界识别的准确性,不仅适用于大部分语种,同时也支持针对符号的选择。第三,本申请实施例通过记录每个字符在待识别图片中的四个顶点坐标,并将四个顶点坐标按照分段区域分别映射至原始图片得到每个字符在原始图片中的坐标,使得坐标输出的颗粒度更细,在执行手动选择操作时,对于文字与文字、文字与符号之间的识别准确,可以快速地实现对相邻字符的分割,在实际应用中具有更强的普适性。第四,得益于OCR检测识别模型的鲁棒性,本申请实施例所提供的字符选择方法在自然场景下对字符的定位也具有更强的鲁棒性。第五,本申请实施例通过将识别得到的各个字符的坐标与原始图片中各个字符的第二坐标关联,并根据第二坐标绘制出相应的已识别出的文本区域,可以支持用户在上述已识别的文本区域中进行任意的选择和范围调整。第六,上述文字识别和选择场景还可以与翻译、扫题等功能相关联,待用户选定后,可以直接翻译选中的文字,或者解答识别到的题目;或者,通过解析用户选中的字符串,如果包含电话、邮箱等格式的文本内容,还可以提取出单独的卡片,方便用户直接拨打电话或发送邮件,具有较强的易用性和实用性。
参照图15,示出了本申请另一实施例提供的弯曲文本的字符选择方法的示意性步骤流程图,该方法具体可以包括如下步骤:
1501、控制终端设备采集原始图片,所述原始图片包含有待识别的弯曲文本。
需要说明的是,本实施例是从用户与手机之间的交互的角度,来对本申请实施例的字符选择方法进行的介绍,即本实施例描述的是用户在使用手机镜头进行文字选择操作时,用户与手机屏幕之间的交互过程。
本实施例中的终端设备可以是手机、平板电脑、智能镜头等设备,本实施例对终端设备的具体类型不作限定。
以终端设备为手机为例。用户可以通过打开手机相机,控制镜头对准某一图片或物体进行拍摄。通常,由于用户手持手机的过程中可能出现晃动,导致屏幕中呈现的图像也可能晃动,因此用户可以通过点击屏幕中的某一位置,将图像固定住,即定帧。在用户点击屏幕的那一刻,相机实时地采集到的图像,即该时刻呈现在屏幕中的图像,便是需要输入至OCR检测识别模型的原始图片。所述原始图片包含有待识别的弯曲文本内容,上述文本区域可以是一行或多行的,每行文本区域可能包含不同类型的字符或符号,本实施例对此不作限定。
1502、确定所述弯曲文本内容中每个字符在所述原始图片中的第二坐标;
在本申请实施例中,对于通过定帧得到的原始图片,OCR检测识别模型可以自动识别该图片中的各个文本区域,并输出与每个文本区域中每一行字符或符号相对应的CTC序列。 上述CTC序列可以提供给字符选择模型进行处理,得到每个字符在原始图片中的第二坐标。
由于采用OCR识别模型识别文本区域中的字符,输出相应的CTC序列,然后采用本实施例提供的字符选择模型对CTC序列进行处理,通过计算各个字符在CTC序列中的坐标,并将其映射为每个字符在原始图片中的第二坐标的过程,与前述实施例中步骤1101-1108类似,可以相互参阅,本实施例对此不再赘述。
1503、根据所述第二坐标,在所述原始图片中生成针对所述每个字符的字符选择控件;
在本申请实施例中,对于识别得到的原始图片中各个文本区域内每个字符的第二坐标,可以根据该第二坐标,在原始图片中生成字符选择控件。上述字符选择控件可以是将各个文本区域中的字符行绘制为第一颜色的高亮背景,以告知用户当前已识别出的文字范围。
在具体实现中,针对每一个CTC序列对应的字符行,可以根据已经得到的该行各个字符的顶点坐标,绘制出覆盖整行全部字符的第一颜色的高亮背景;或者,也可以根据改行首尾两个字符的顶点坐标,绘制出覆盖本字符行的高亮背景。
根据各个字符在原始图片中的第二坐标绘制字符选择控件,可以参见图14a所示。
1504、当监听到针对目标字符的选择事件时,输出已选中的所述目标字符,所述目标字符为所述文本区域中的任意一个或多个字符。
在本申请实施例中,针对目标字符的选择事件可以是指用户在已绘制有字符选择控件的手机屏幕上,点击某个文本区域所产生的事件。例如,用户在图14a中绘制有高亮背景的任一文本区域内执行的点击操作。
若用户点击到文本区域,则可以首先计算出当前点击的行位置,然后将点击到的文字设置为选中状态,绘制出第二颜色的文字背景。如图14b所示,终端设备监听用户的移动区域,通过计算移动区域的起启位置和终结位置,将其包含的区域设置为选中状态,并将该区域绘制为深色高亮背景,同时在区域的首尾出现可拖动的把手,用户可以通过拖动把手修改选中的文字范围。当然,对于点击到的文字,在绘制文字背景时,可以根据实际需要确定所要绘制的背景颜色,本实施例对此不作限定。
另一方面,在被点击选中的文字的首尾位置处,还可以生成相应的可拖动的把手符号,用于提示用户可以通过拖动该把手修改选中的范围。
若用户点击其中一个把手并拖动,则可以根据被用户拖动的把手所处的位置,重新确定选中的文字范围。
例如,对于初始时处于首尾位置的可拖动把手,若用户拖动左侧把手,则可以识别被拖动后的左侧把手所处的位置,与未被拖动的右侧把手之间的位置,作为重现确定的被选中的文字范围。即,从被拖动的左侧把手的右侧直到该行最后一个字符,即是被选中的文字范围。
或者,若用户拖动右侧把手,则可以识别被拖动后的右侧把手所处的位置,与未被拖动的左侧把手之间的位置,作为重现确定的被选中的文字范围。即,从该行第一个字符,到被拖动的右侧把手的左侧的那个字符,即是被选中的文字范围。
当然,用户也可以先后拖动左侧把手和右侧把手,此时重新确定的文字范围即是被拖动后的左侧当前所处的位置的右侧第一个字符,到被拖动的右侧把手当前所处的位置的左 侧第一字符之间的全部字符。
用户拖动左侧或右侧把手的动作可以反复操作,直到用户认为当前点选得到的文字已满足预期。待用户点选完成后,可以在手机界面中弹出文字卡片,将用户选中的文字内容展示出来。
在具体实现中,用户是否完成点选操作可以在用户拖动某个把手后,等待一定时长,例如1秒,若未再次监听到用户拖动某个把手的动作,则可以认为用户已完成选择操作。
如图16a-图16c所示,是采用上述字符选择方法选中字符时的实例效果展示示意图。在图16a中显示被选中的字符为“成本会”。用户可以通过拖动右侧把手,将其移动至字符“会”的左侧,呈现出如图16b所示的效果,此时被选中的字符为“成本”。若用户向右拖动右侧把手,将其移动至字符“计”的右侧,可以得到如图16c所示的效果,此时被选中的字符为“成本会计”。
在本申请实施例中,文字卡片上还可以提供其他操作按钮,对于展示出来的文字,用户可以通过点击不同的按钮,指示与该按钮对应的应用程序根据所展示的文字,执行相应的操作。
如图17所示,是在用户选中目标字符后,输出已选中的目标字符的示意图。上述已选中的目标字符可以是对图14b或图16a-图16c中的字符选择控件的左右两个把手进行拖动后得到的。例如,在图16c的基础上,用户可以拖动左侧把手,将其拖动至字符“计”的左侧,从而得到如图17的选中效果,当前选中的目标字符即是“成本会”。在图17所示的手机界面中,对于输出的已选中的目标字符,用户可以通过点击相应的按钮,对上述目标字符进行翻译、搜索等处理。
在具体实现中,在监测到用户点击了翻译、搜索等按钮时,表示用户希望手机针对已选中的目标文字进行相应的处理。此时,手机可以调用可执行上述操作的应用程序。
需要说明的是,若手机中已安装的应用程序中仅有某一个应用程序可以执行用户希望处理的动作,则可以直接调用该应用程序进行处理。例如,在用户点击“翻译”按钮后,若手机中仅有某个词典类应用程序可以对上述目标字符进行翻译,则可以直接调用该程序。若手机中已安装的应用程序中有多个应用程序可以执行用户希望处理的动作,例如,在用户点击“搜索”按钮后,存在三个应用程序(应用A、应用B和应用C)均可以执行该操作,此时可以在手机界面中弹出如图18所示的界面,供用户选择某一应用执行该操作。对于选定的某一应用,用户还可以将其设置为默认应用程序。后续在执行相同的处理时,可以直接调用默认应用程序执行该操作。
本申请实施例,通过利用OCR检测识别模型中现有的卷积神经网络(Convolutional Neural Network,CNN)结构,根据CNN下采样(8倍)的特点,建立了检测识别模型输出的CTC序列索引到原始图片坐标之间的映射关系,整个过程使用的模型简单,需要标注的数据量少;在处理时,相对于现有技术中需要提供原图或者矫正后的图,或者需要遍历图上所有像素的方案,本申请实施例可以使用实时检测或采集到的图片,计算速度快,可快速、方便地适配和部署到现有的OCR模型中。其次,本申请实施例在确定字符边界时,可以根据不同语种设置超参对字符边界进行微调,保证字符边界识别的准确性,不仅适用于 大部分语种,同时也支持针对符号的选择。第三,本申请实施例通过记录每个字符在待识别图片中的四个顶点坐标,并将四个顶点坐标按照分段区域分别映射至原始图片得到每个字符在原始图片中的坐标,使得坐标输出的颗粒度更细,在执行手动选择操作时,对于文字与文字、文字与符号之间的识别准确,可以快速地实现对相邻字符的分割,在实际应用中具有更强的普适性。第四,得益于OCR检测识别模型的鲁棒性,本申请实施例所提供的字符选择方法在自然场景下对字符的定位也具有更强的鲁棒性。第五,本申请实施例通过将识别得到的各个字符的坐标与原始图片中各个字符的第二坐标关联,并根据第二坐标绘制出相应的已识别出的文本区域,可以支持用户在上述已识别的文本区域中进行任意的选择和范围调整。第六,上述文字识别和选择场景还可以与翻译、扫题等功能相关联,待用户选定后,可以直接翻译选中的文字,或者解答识别到的题目;或者,通过解析用户选中的字符串,如果包含电话、邮箱等格式的文本内容,还可以提取出单独的卡片,方便用户直接拨打电话或发送邮件,具有较强的易用性和实用性。通过将字符选择模型识别得到的各个字符的坐标与原始图片中各个字符的第二坐标关联,
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
对应于上文实施例所述的弯曲文本的字符选择方法,图19示出了本申请实施例字符选择装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。
本申请实施例中该字符选择装置1900包括:显示模块1901、处理模块1902、存储模块1903,其中显示模块1901、处理模块1902、存储模块1903通过总线连接。字符选择装置1900可以是上述方法实施例中的终端设备,也可以配置为终端设备内的一个或多个芯片。字符选择装置1900可以用于执行上述方法实施例中的终端设备的部分或全部功能。
该显示模块1901,用于在终端设备的显示界面中显示原始图片,所述原始图片中包含弯曲文本;
处理模块1902,用于检测所述原始图片,生成包含有直线文本的待识别图片,所述直线文本的文本内容与所述弯曲文本的文本内容一一对应;根据所述待识别图片,识别所述直线文本的所述文本内容,获得与所述直线文本的所述文本内容相对应的连接时序分类序列,所述连接时序分类序列包括了多个字符;计算所述多个字符中每个字符在所述待识别图片中的第一坐标;确定所述多个字符中每个字符对应的所述第一坐标所处的分段区域;根据所述原始图片和所述待识别图片,确定将所述原始图片变换为所述待识别图片时各个分段区域对应的透视变换矩阵;将所述多个字符中每个字符的第一坐标与所述透视变换矩阵相乘,得到所述多个字符中每个字符在所述原始图片中的第二坐标;检测到用户在所述原始图片上的第一操作,所述第一操作用于对所述原始图片上的弯曲文本中的字符进行选择;
该显示模块1901,用于根据所述多个字符中每个字符在所述原始图片中的第二坐标突出显示被选中的字符。
可选的,该处理模块1902,还用于根据所述多个字符中每个字符对应的所述第二坐标,在所述原始图片中生成第一提示信息,所述第一提示信息用于指示用户可对所述原始图片 中的字符进行选择。
可选的,该处理模块1902,具体用于根据所述待识别图片,识别所述直线文本的所述文本内容,获得初始连接时序分类序列;确定所述初始连接时序分类序列的长度,以及确定所述待识别图片的图片宽度;若所述初始连接时序分类序列的长度与预设的下采样倍数的乘积大于所述待识别图片的图片宽度,则对所述初始连接时序分类序列进行裁剪,获得与所述直线文本的所述文本内容相对应的连接时序分类序列;其中,裁剪后获得的所述连接时序分类序列的长度与预设的下采样倍数的乘积小于或等于所述待识别图片的图片宽度。
可选的,该处理模块1902,具体用于依次裁剪所述初始连接时序分类序列的头部元素或尾部元素;当裁剪任一头部元素或尾部元素后,计算裁剪后的初始连接时序分类序列的长度与预设的下采样倍数的乘积是否小于或等于所述待识别图片的图片宽度;若裁剪后的初始连接时序分类序列的长度与预设的下采样倍数的乘积小于或等于所述待识别图片的图片宽度,则停止剪裁,输出与所述直线文本的所述文本内容相对应的连接时序分类序列。
可选的,该处理模块1902,具体用于确定所述连接时序分类序列中所述多个字符中每个字符的字符边界坐标,所述字符边界坐标包括左边界坐标和右边界坐标;根据所述多个字符中每个字符的字符边界坐标,计算所述多个字符中每个字符在所述待识别图片中的第一坐标。
可选的,该处理模块1902,具体用于针对所述连接时序分类序列中任一字符,获取所述字符的原始右边界坐标,以及下一字符的原始左边界坐标;计算所述原始右边界坐标与所述原始左边界坐标的平均值;基于所述平均值,确定所述字符的右边界坐标,以及下一字符的左边界坐标。
可选的,该处理模块1902,具体用于分别确定所述字符的第一字符类型和所述下一字符的第二字符类型,所述第一字符类型和所述第二字符类型分别具有相应的偏移量;计算所述平均值减去所述第一字符类型对应的偏移量,得到第一差值,将所述第一差值作为所述字符的右边界坐标;计算所述平均值加上所述第二字符类型对应的偏移量,得到第二和值,将所述第二和值作为所述下一字符的左边界坐标。
可选的,所述每个字符在所述待识别图片中的第一坐标包括第一顶点坐标、第二顶点坐标、第三顶点坐标以及第四顶点坐标;该处理模块1902,具体用于将所述多个字符中每个字符的左边界坐标和右边界坐标分别与预设的下采样倍数相乘,得到所述多个字符中每个字符在所述待识别图片中的第一顶点坐标和第二顶点坐标;根据所述第一顶点坐标、第二顶点坐标以及所述待识别图片的图片高度,确定所述多个字符中每个字符在所述待识别图片中的第三顶点坐标和第四顶点坐标。
可选的,该处理模块1902,具体用于遍历所述每个字符的第一坐标以及所述待识别图片的分段区域,分别确定所述第一顶点坐标所处第一分段区域,所述第二顶点坐标所处第二分段区域,所述第三顶点坐标所处第三分段区域和所述第四顶点坐标所处第四分段区域。
可选的,该处理模块1902,具体用于将所述第一顶点坐标与所述第一分段区域的透视变换矩阵相乘得到第三坐标,将所述第二顶点坐标与所述第二分段区域的透视变换矩阵相 乘得到第四坐标,将所述第三顶点坐标与所述第三分段区域的透视变换矩阵相乘得到第五坐标,将所述第四顶点坐标与所述第四分段区域的透视变换矩阵相乘得到第六坐标;其中,所述第三坐标、所述第四坐标、所述第五坐标和所述第六坐标作为字符在所述原始图片中的第二坐标。
可选的,该处理模块1902,具体用于根据所述多个字符中每个字符对应的第二坐标,在所述原始图片中将所述多个字符中每个字符的字符区域绘制为第一颜色;或,根据所述多个字符中每个字符对应的第二坐标,在所述原始图片中为所述多个字符中每个字符的字符区域绘制文本框。
可选的,该处理模块1902,还用于当监听到所述用户在字符区域的点击事件时,将包含点击到的字符区域的整行字符区域绘制为第二颜色;监听所述用户在所述第二颜色对应的字符区域的拖动事件,根据所述拖动事件调整所述第二颜色所覆盖的字符区域;识别并展示所述第二颜色所覆盖的字符区域内的各个字符。
可选的,该处理模块1902,还用于当监听到所述用户在字符区域的点击事件时,将距离所述点击事件指示的点击位置最近的字符区域绘制为第二颜色;监听所述用户在所述第二颜色对应的字符区域的拖动事件,根据所述拖动事件调整所述第二颜色所覆盖的字符区域;识别并展示所述第二颜色所覆盖的字符区域内的各个字符。
可选的,该处理模块1902,还用于当监听到所述用户在字符区域的滑动事件时,将包含滑动事件指示的字符区域绘制为第二颜色;监听所述用户在所述第二颜色对应的字符区域的拖动事件,根据所述拖动事件调整所述第二颜色所覆盖的字符区域;识别并展示所述第二颜色所覆盖的字符区域内的各个字符。
该存储模块1903,用于保存字符选择装置必要的程序指令和数据。
应理解,上述图19对应实施例中终端设备的各模块之间所执行的流程与前述图3至图18中对应方法实施例中的终端设备执行的流程类似,具体此处不再赘述。
图20示出了上述实施例中一种字符选择装置2000可能的结构示意图,该字符选择装置2000可以配置成是前述用户设备。该第一通信装置2000可以包括:处理器2002、计算机可读存储介质/存储器2003、收发器2004、输入设备2005和输出设备2006,以及总线2001。其中,处理器,收发器,计算机可读存储介质等通过总线连接。本申请实施例不限定上述部件之间的具体连接介质。
一个示例中,该输出设备2006显示原始图片,所述原始图片中包含弯曲文本;该处理器2002检测所述原始图片,生成包含有直线文本的待识别图片,所述直线文本的文本内容与所述弯曲文本的文本内容一一对应;
根据所述待识别图片,识别所述直线文本的所述文本内容,获得与所述直线文本的所述文本内容相对应的连接时序分类序列,所述连接时序分类序列包括了多个字符;计算所述多个字符中每个字符在所述待识别图片中的第一坐标;确定所述多个字符中每个字符对应的所述第一坐标所处的分段区域;根据所述原始图片和所述待识别图片,确定将所述原始图片变换为所述待识别图片时各个分段区域对应的透视变换矩阵;将所述多个字符中每个字符的第一坐标与所述透视变换矩阵相乘,得到所述多个字符中每个字符在所述原始图 片中的第二坐标;检测到用户在所述原始图片上的第一操作,所述第一操作用于对所述原始图片上的弯曲文本中的字符进行选择;该输出设备2006根据所述多个字符中每个字符在所述原始图片中的第二坐标突出显示被选中的字符。
一个示例中,处理器2002可以包括基带电路。收发器2004可以包括射频电路。
又一个示例中,处理器2002可以运行操作系统,控制各个设备和器件之间的功能。收发器2004可以包括基带电路和射频电路。
该收发器2004与该处理器2002可以实现上述图3至图18中任一实施例中相应的步骤,具体此处不做赘述。
可以理解的是,图20仅仅示出了字符选择装置的简化设计,在实际应用中,字符选择装置可以包含任意数量的收发器,处理器,存储器等,而所有的可以实现本申请的字符选择装置都在本申请的保护范围之内。
上述装置2000中涉及的处理器2002可以是通用处理器,例如CPU、网络处理器(network processor,NP)、微处理器等,也可以是ASIC,或一个或多个用于控制本申请方案程序执行的集成电路。还可以是数字信号处理器(digital signal processor,DSP)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。控制器/处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。处理器通常是基于存储器内存储的程序指令来执行逻辑和算术运算。
上述涉及的总线2001可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,图20中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
上述涉及的计算机可读存储介质/存储器2003还可以保存有操作系统和其他应用程序。具体地,程序可以包括程序代码,程序代码包括计算机操作指令。更具体的,上述存储器可以是ROM、可存储静态信息和指令的其他类型的静态存储设备、RAM、可存储信息和指令的其他类型的动态存储设备、磁盘存储器等等。存储器2003可以是上述存储类型的组合。并且上述计算机可读存储介质/存储器可以在处理器中,还可以在处理器的外部,或在包括处理器或处理电路的多个实体上分布。上述计算机可读存储介质/存储器可以具体体现在计算机程序产品中。举例而言,计算机程序产品可以包括封装材料中的计算机可读介质。
可以替换的,本申请实施例还提供一种通用处理系统,例如通称为芯片,该通用处理系统包括:提供处理器功能的一个或多个微处理器;以及提供存储介质的至少一部分的外部存储器,所有这些都通过外部总线体系结构与其它支持电路连接在一起。当存储器存储的指令被处理器执行时,使得处理器执行第一通信装置在图3至图18该实施例中的数据传输方法中的部分或全部步骤,和/或用于本申请所描述的技术的其它过程。
结合本申请公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬 盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于终端中。当然,处理器和存储介质也可以作为分立组件存在于字符选择装置中。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (16)

  1. 一种弯曲文本的字符选择方法,其特征在于,包括:
    在终端设备的显示界面中显示原始图片,所述原始图片中包含弯曲文本;
    检测所述原始图片,生成包含有直线文本的待识别图片,所述直线文本的文本内容与所述弯曲文本的文本内容一一对应;
    根据所述待识别图片,识别所述直线文本的所述文本内容,获得与所述直线文本的所述文本内容相对应的连接时序分类序列,所述连接时序分类序列包括了多个字符;
    计算所述多个字符中每个字符在所述待识别图片中的第一坐标;
    确定所述多个字符中每个字符对应的所述第一坐标所处的分段区域;
    根据所述原始图片和所述待识别图片,确定将所述原始图片变换为所述待识别图片时各个分段区域对应的透视变换矩阵;
    将所述多个字符中每个字符的第一坐标与所述透视变换矩阵相乘,得到所述多个字符中每个字符在所述原始图片中的第二坐标;
    检测到用户在所述原始图片上的第一操作,根据所述多个字符中每个字符在所述原始图片中的第二坐标突出显示被选中的字符,所述第一操作用于对所述原始图片上的弯曲文本中的字符进行选择。
  2. 根据权利要求1所述的方法,其特征在于,在所述将所述多个字符中每个字符的第一坐标与所述透视变换矩阵相乘,得到所述多个字符中每个字符在所述原始图片中的第二坐标之后,检测到用户在所述原始图片上的第一操作之前,所述方法还包括:
    根据所述多个字符中每个字符对应的所述第二坐标,在所述原始图片中生成第一提示信息,所述第一提示信息用于指示用户可对所述原始图片中的字符进行选择。
  3. 根据权利要求1或2所述的方法,其特征在于,所述根据所述待识别图片,识别所述直线文本的所述文本内容,获得与所述直线文本的所述文本内容相对应的连接时序分类序列包括:
    根据所述待识别图片,识别所述直线文本的所述文本内容,获得初始连接时序分类序列;
    确定所述初始连接时序分类序列的长度,以及确定所述待识别图片的图片宽度;
    若所述初始连接时序分类序列的长度与预设的下采样倍数的乘积大于所述待识别图片的图片宽度,则对所述初始连接时序分类序列进行裁剪,获得与所述直线文本的所述文本内容相对应的连接时序分类序列;
    其中,裁剪后获得的所述连接时序分类序列的长度与预设的下采样倍数的乘积小于或等于所述待识别图片的图片宽度。
  4. 根据权利要求3所述的方法,其特征在于,所述对所述初始连接时序分类序列进行裁剪,获得与所述直线文本的所述文本内容相对应的连接时序分类序列,包括:
    依次裁剪所述初始连接时序分类序列的头部元素或尾部元素;
    当裁剪任一头部元素或尾部元素后,计算裁剪后的初始连接时序分类序列的长度与预设的下采样倍数的乘积是否小于或等于所述待识别图片的图片宽度;
    若裁剪后的初始连接时序分类序列的长度与预设的下采样倍数的乘积小于或等于所述待识别图片的图片宽度,则停止剪裁,输出与所述直线文本的所述文本内容相对应的连接时序分类序列。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述计算所述多个字符中每个字符在所述待识别图片中的第一坐标包括:
    确定所述连接时序分类序列中所述多个字符中每个字符的字符边界坐标,所述字符边界坐标包括左边界坐标和右边界坐标;
    根据所述多个字符中每个字符的字符边界坐标,计算所述多个字符中每个字符在所述待识别图片中的第一坐标。
  6. 根据权利要求5所述的方法,其特征在于,所述确定所述连接时序分类序列中所述多个字符中每个字符的字符边界坐标包括:
    针对所述连接时序分类序列中任一字符,获取所述字符的原始右边界坐标,以及下一字符的原始左边界坐标;
    计算所述原始右边界坐标与所述原始左边界坐标的平均值;
    基于所述平均值,确定所述字符的右边界坐标,以及下一字符的左边界坐标。
  7. 根据权利要求6所述的方法,其特征在于,所述基于所述平均值,确定所述字符的右边界坐标,以及下一字符的左边界坐标,包括:
    分别确定所述字符的第一字符类型和所述下一字符的第二字符类型,所述第一字符类型和所述第二字符类型分别具有相应的偏移量;
    计算所述平均值减去所述第一字符类型对应的偏移量,得到第一差值,将所述第一差值作为所述字符的右边界坐标;
    计算所述平均值加上所述第二字符类型对应的偏移量,得到第二和值,将所述第二和值作为所述下一字符的左边界坐标。
  8. 根据权利要求5所述的方法,其特征在于,所述多个字符中每个字符在所述待识别图片中的第一坐标包括第一顶点坐标、第二顶点坐标、第三顶点坐标以及第四顶点坐标;
    所述根据所述多个字符中每个字符的字符边界坐标,计算所述多个字符中每个字符在所述待识别图片中的第一坐标,包括:
    将所述多个字符中每个字符的左边界坐标和右边界坐标分别与预设的下采样倍数相乘,得到所述多个字符中每个字符在所述待识别图片中的第一顶点坐标和第二顶点坐标;
    根据所述第一顶点坐标、第二顶点坐标以及所述待识别图片的图片高度,确定所述多个字符中每个字符在所述待识别图片中的第三顶点坐标和第四顶点坐标。
  9. 根据权利要求8所述的方法,其特征在于,所述确定所述第一坐标所处分段区域包括:
    遍历所述每个字符的第一坐标以及所述待识别图片的分段区域,分别确定所述第一顶点坐标所处第一分段区域,所述第二顶点坐标所处第二分段区域,所述第三顶点坐标所处第三分段区域和所述第四顶点坐标所处第四分段区域。
  10. 根据权利要求9所述的方法,其特征在于,所述将所述多个字符中每个字符的第一 坐标与所述透视变换矩阵相乘,得到所述多个字符中每个字符在所述原始图片中的第二坐标包括:
    将所述第一顶点坐标与所述第一分段区域的透视变换矩阵相乘得到第三坐标,将所述第二顶点坐标与所述第二分段区域的透视变换矩阵相乘得到第四坐标,将所述第三顶点坐标与所述第三分段区域的透视变换矩阵相乘得到第五坐标,将所述第四顶点坐标与所述第四分段区域的透视变换矩阵相乘得到第六坐标;
    其中,所述第三坐标、所述第四坐标、所述第五坐标和所述第六坐标作为字符在所述原始图片中的第二坐标。
  11. 根据权利要求2所述的方法,其特征在于,所述根据所述多个字符中每个字符对应的所述第二坐标,在所述原始图片中生成第一提示信息包括:
    根据所述多个字符中每个字符对应的第二坐标,在所述原始图片中将所述多个字符中每个字符的字符区域绘制为第一颜色;
    或,
    根据所述多个字符中每个字符对应的第二坐标,在所述原始图片中为所述多个字符中每个字符的字符区域绘制文本框。
  12. 根据权利要求11所述的方法,其特征在于,还包括:
    当监听到所述用户在字符区域的点击事件时,将包含点击到的字符区域的整行字符区域绘制为第二颜色;
    监听所述用户在所述第二颜色对应的字符区域的拖动事件,根据所述拖动事件调整所述第二颜色所覆盖的字符区域;
    识别并展示所述第二颜色所覆盖的字符区域内的各个字符。
  13. 根据权利要求11所述的方法,其特征在于,还包括:
    当监听到所述用户在字符区域的点击事件时,将距离所述点击事件指示的点击位置最近的字符区域绘制为第二颜色;
    监听所述用户在所述第二颜色对应的字符区域的拖动事件,根据所述拖动事件调整所述第二颜色所覆盖的字符区域;
    识别并展示所述第二颜色所覆盖的字符区域内的各个字符。
  14. 根据权利要求11所述的方法,其特征在于,还包括:
    当监听到所述用户在字符区域的滑动事件时,将包含滑动事件指示的字符区域绘制为第二颜色;
    监听所述用户在所述第二颜色对应的字符区域的拖动事件,根据所述拖动事件调整所述第二颜色所覆盖的字符区域;
    识别并展示所述第二颜色所覆盖的字符区域内的各个字符。
  15. 一种字符选择装置,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至14任一项所述的方法。
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征 在于,所述计算机程序被处理器执行时实现如权利要求1至14任一项所述的方法。
PCT/CN2021/115904 2020-10-31 2021-09-01 一种弯曲文本的字符选择方法、装置和终端设备 WO2022088946A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011199028.4 2020-10-31
CN202011199028.4A CN114529926A (zh) 2020-10-31 2020-10-31 一种弯曲文本的字符选择方法、装置和终端设备

Publications (1)

Publication Number Publication Date
WO2022088946A1 true WO2022088946A1 (zh) 2022-05-05

Family

ID=81381851

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/115904 WO2022088946A1 (zh) 2020-10-31 2021-09-01 一种弯曲文本的字符选择方法、装置和终端设备

Country Status (2)

Country Link
CN (1) CN114529926A (zh)
WO (1) WO2022088946A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115033169A (zh) * 2022-05-20 2022-09-09 长沙朗源电子科技有限公司 一种电子白板触摸屏书写擦除方法、设备及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116630994B (zh) * 2023-07-19 2023-10-20 腾讯科技(深圳)有限公司 输入法中的笔势识别方法、相关装置和介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137349A1 (en) * 2016-11-14 2018-05-17 Kodak Alaris Inc. System and method of character recognition using fully convolutional neural networks
CN110427938A (zh) * 2019-07-26 2019-11-08 中科视语(北京)科技有限公司 一种基于深度学习的不规则文字识别装置和方法
CN110458167A (zh) * 2019-08-20 2019-11-15 浙江工业大学 一种金属件表面弯曲文本行矫正方法
CN111191649A (zh) * 2019-12-31 2020-05-22 上海眼控科技股份有限公司 一种识别弯曲多行文本图像的方法与设备
CN111223065A (zh) * 2020-01-13 2020-06-02 中国科学院重庆绿色智能技术研究院 图像矫正方法、不规则文本识别方法、装置、存储介质和设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137349A1 (en) * 2016-11-14 2018-05-17 Kodak Alaris Inc. System and method of character recognition using fully convolutional neural networks
CN110427938A (zh) * 2019-07-26 2019-11-08 中科视语(北京)科技有限公司 一种基于深度学习的不规则文字识别装置和方法
CN110458167A (zh) * 2019-08-20 2019-11-15 浙江工业大学 一种金属件表面弯曲文本行矫正方法
CN111191649A (zh) * 2019-12-31 2020-05-22 上海眼控科技股份有限公司 一种识别弯曲多行文本图像的方法与设备
CN111223065A (zh) * 2020-01-13 2020-06-02 中国科学院重庆绿色智能技术研究院 图像矫正方法、不规则文本识别方法、装置、存储介质和设备

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115033169A (zh) * 2022-05-20 2022-09-09 长沙朗源电子科技有限公司 一种电子白板触摸屏书写擦除方法、设备及存储介质

Also Published As

Publication number Publication date
CN114529926A (zh) 2022-05-24

Similar Documents

Publication Publication Date Title
CN113453040B (zh) 短视频的生成方法、装置、相关设备及介质
EP3547218B1 (en) File processing device and method, and graphical user interface
WO2022088946A1 (zh) 一种弯曲文本的字符选择方法、装置和终端设备
US11914850B2 (en) User profile picture generation method and electronic device
US20230353864A1 (en) Photographing method and apparatus for intelligent framing recommendation
EP4109879A1 (en) Image color retention method and device
US20230224574A1 (en) Photographing method and apparatus
WO2024021742A1 (zh) 一种注视点估计方法及相关设备
US20240187725A1 (en) Photographing method and electronic device
CN115689963A (zh) 一种图像处理方法及电子设备
KR20220093091A (ko) 라벨링 방법 및 장치, 전자 기기 및 기억 매체
CN115567783B (zh) 一种图像处理方法
US20230014272A1 (en) Image processing method and apparatus
US20230169785A1 (en) Method and apparatus for character selection based on character recognition, and terminal device
CN115883958A (zh) 一种人像拍摄方法
CN115707355A (zh) 图像处理方法、装置及存储介质
CN116757963B (zh) 图像处理方法、电子设备、芯片系统及可读存储介质
WO2023231696A1 (zh) 一种拍摄方法及相关设备
CN116091572B (zh) 获取图像深度信息的方法、电子设备及存储介质
WO2022100602A1 (zh) 在电子设备上显示信息的方法及电子设备
EP4210312A1 (en) Photographing method and electronic device
US20240305876A1 (en) Shooting Method and Electronic Device
US20240062392A1 (en) Method for determining tracking target and electronic device
WO2023072113A1 (zh) 显示方法及电子设备
US20240046504A1 (en) Image processing method and electronic device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21884677

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21884677

Country of ref document: EP

Kind code of ref document: A1