CN111507350B - Text recognition method and device - Google Patents

Text recognition method and device Download PDF

Info

Publication number
CN111507350B
CN111507350B CN202010298644.9A CN202010298644A CN111507350B CN 111507350 B CN111507350 B CN 111507350B CN 202010298644 A CN202010298644 A CN 202010298644A CN 111507350 B CN111507350 B CN 111507350B
Authority
CN
China
Prior art keywords
text
characters
character
identified
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010298644.9A
Other languages
Chinese (zh)
Other versions
CN111507350A (en
Inventor
王皓
周宇超
康斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010298644.9A priority Critical patent/CN111507350B/en
Publication of CN111507350A publication Critical patent/CN111507350A/en
Application granted granted Critical
Publication of CN111507350B publication Critical patent/CN111507350B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Character Discrimination (AREA)

Abstract

The embodiment of the invention discloses a text recognition method and a text recognition device; after obtaining a text to be recognized and fonts of the text to be recognized in a service platform, screening text characters which do not belong to a preset text character library from the text to be recognized, obtaining special characters, converting the special characters into images according to the fonts, obtaining character images, recognizing the character images by adopting an image recognition model, screening candidate text characters similar to the character images from the preset text character library, determining target text characters corresponding to the special characters from the candidate text characters according to the context information of the special characters in the text to be recognized, and recognizing the text to be recognized based on the target text characters; the scheme can improve the accuracy of identifying junk texts in the texts to be identified.

Description

Text recognition method and device
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a text recognition method and apparatus.
Background
In recent years, with the rapid development of internet technology, user originated content (User Generated Content, UGC) on the internet is also increasing, especially text content. The text content has a jaundice and uneven quality, and in order to purify the internet environment, some junk texts with poor content quality need to be identified and intercepted, and the existing text identification technology mainly adopts regular expressions and neural networks to identify the text content.
In the research and practice process of the prior art, the inventor discovers that for the existing text recognition method, some junk texts often contain special characters such as special symbols, characters and similar words, so that the accuracy of recognizing text contents by regular expressions and neural networks is greatly reduced, and therefore, the accuracy of recognizing junk texts in the texts is greatly reduced.
Disclosure of Invention
The embodiment of the invention provides a text recognition method and a text recognition device. The accuracy of identifying junk text in the text can be improved.
A text recognition method, comprising:
acquiring a text to be identified and fonts used by the text to be identified in a service platform, wherein the text to be identified comprises a plurality of text characters;
selecting text characters which do not belong to a preset text character library from the text to be recognized to obtain special characters, and converting the special characters into images according to the fonts to obtain character images;
the character images are identified by adopting an image identification model to screen candidate text characters similar to the character images from the preset text character library, the image identification model is trained by a plurality of character image samples, and the character image samples are images formed by converting text characters in the preset text character library according to different fonts;
Determining a target text character corresponding to the special character from the candidate text characters according to the context information of the special character in the text to be recognized;
and identifying the text to be identified based on the target text character.
Correspondingly, an embodiment of the present invention provides a text recognition device, including:
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a text to be identified and fonts used by the text to be identified in a service platform, and the text to be identified comprises a plurality of text characters;
the conversion unit is used for screening text characters which do not belong to a preset text character library from the text to be identified to obtain special characters, and converting the special characters into images according to the fonts to obtain character images;
the screening unit is used for identifying the character images by adopting an image identification model so as to screen candidate text characters similar to the character images from the preset text character library, the image identification model is trained by a plurality of character image samples, and the character image samples are images formed by converting text characters in the preset text character library according to different fonts;
the determining unit is used for determining a target text character corresponding to the special character from the candidate text characters according to the context information of the special character in the text to be recognized;
And the recognition unit is used for recognizing the text to be recognized based on the target text characters.
Optionally, in some embodiments, the filtering unit may be specifically configured to perform multi-scale feature extraction on the character image by using an image recognition model to obtain local feature information corresponding to different scales; fusing the local feature information to obtain global feature information of the character image; and screening one or more candidate text characters similar to the character image from the preset text character library according to the global characteristic information.
Optionally, in some embodiments, the determining unit may be specifically configured to, when the number of the candidate text characters that are screened out and that are similar to the character image is one, use the candidate text character as the target text character corresponding to the special character; and when the number of the candidate text characters which are screened out and are similar to the character image is a plurality of, determining a target text character corresponding to the special character from the candidate text characters according to the context information of the special character in the text to be identified.
Optionally, in some embodiments, the determining unit may be specifically configured to screen, according to the text information about the special character in the text to be recognized, a first adjacent text character of the special character in the text to be recognized; determining association information of the first adjacent text character and the candidate text character; and determining the target text character corresponding to the special character from the candidate text characters according to the association information.
Optionally, in some embodiments, the identifying unit may be specifically configured to replace a special character in a text character of the text to be identified with the target text character to obtain an identifiable text character of the text to be identified, where the identifiable text character may be identified by the preset text character library; extracting features of the identifiable text characters of the text characters to be identified to obtain text features of the text to be identified; and identifying the text to be identified according to the text characteristics of the text to be identified.
Optionally, in some embodiments, the identifying unit may be specifically configured to perform feature extraction on identifiable text characters of the text to be identified to obtain text features of the identifiable text characters; and fusing the text characteristics of the identifiable text characters to obtain the text characteristics of the text to be identified.
Optionally, in some embodiments, the identifying unit may be specifically configured to obtain location information of the identifiable text character in the text to be identified; screening out a second adjacent text character capable of identifying the text character from the text to be identified according to the position information; and extracting the characteristics of the second adjacent text characters to obtain the text characteristics of the identifiable text characters.
Optionally, in some embodiments, the identifying unit may be specifically configured to perform feature extraction on the identifiable text character to obtain an initial text feature of the identifiable text character; determining text features of the second adjacent text characters based on the initial text features of the recognizable text characters; and adjusting the initial text characteristics of the identifiable text characters based on the text characteristics of the second adjacent text characters to obtain the text characteristics of the identifiable text characters.
Optionally, in some embodiments, the identifying unit may be specifically configured to fuse text features of the identifiable text characters to obtain a first initial text feature of the text to be identified; screening text characters which are not repeated from the identifiable text characters; performing feature fusion on the text features of the text characters which are not repeated to obtain second initial text features of the text to be recognized; and splicing the first initial text feature and the second initial text feature to obtain the text feature of the text to be identified.
Optionally, in some embodiments, the identifying unit may be specifically configured to calculate a similarity between a text feature of the text to be identified and a text feature in a preset text feature library; and identifying the text to be identified according to the similarity.
Optionally, in some embodiments, the identifying unit may be specifically configured to segment text features of the text to be identified to obtain a plurality of sub-text features; clustering a target sub-text feature library corresponding to the sub-text feature in the preset text feature library; calculating initial similarity between the sub-text features and text features in the target sub-text feature library; and fusing the initial similarity to obtain the similarity between the text features of the text to be identified and the text features in a preset text feature library.
Optionally, in some embodiments, the identifying unit may be specifically configured to screen a preset number of target preset text features from the preset text feature library according to the similarity, where the preset text feature library includes preset normal text features and preset junk text features; when all the target preset text features are preset normal text features, determining that the text to be recognized is a normal text; when all the target preset text features are preset junk text features, determining that the text to be identified is junk text; when the preset normal text features and the preset junk text features exist in the target preset text features, normal text similarity and junk text similarity are selected from the similarity, the text to be identified is identified according to the normal text similarity and the junk text similarity, the normal text similarity is the similarity between the text features of the text to be identified and the preset normal text features, and the junk text similarity is the similarity between the text features of the text to be identified and the preset junk text features.
Optionally, in some embodiments, the identifying unit may be specifically configured to screen out a normal text similarity and a junk text similarity from the similarities; respectively weighting the normal text similarity and the junk text similarity to obtain a first weighted value of the normal text similarity and a second weighted value of the junk text similarity; when the first weighted value exceeds the second weighted value, determining that the text to be recognized is a normal text; and when the first weighted value does not exceed the second weighted value, determining that the text to be recognized is junk text.
Optionally, in some embodiments, the identifying unit may be specifically configured to intercept the text to be identified when the text to be identified is junk text.
In addition, the embodiment of the invention also provides electronic equipment, which comprises a processor and a memory, wherein the memory stores application programs, and the processor is used for running the application programs in the memory to realize the text recognition method provided by the embodiment of the invention.
In addition, the embodiment of the invention also provides a computer readable storage medium, which stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor to execute the steps in any text recognition method provided by the embodiment of the invention.
After obtaining a text to be recognized and fonts of the text to be recognized in a service platform, the text to be recognized comprises a plurality of text characters, text characters which do not belong to a preset text character library are screened out from the text to be recognized, special characters are obtained, the special characters are converted into images according to the fonts to obtain character images, the character images are recognized by adopting an image recognition model to screen candidate text characters similar to the character images from the preset text character library, the image recognition model is formed by training a plurality of character image samples, the character image samples are images formed by converting the text characters in the preset text character library according to different fonts, target text characters corresponding to the special characters are determined in the candidate text characters according to context information of the special characters, and the text to be recognized is recognized based on the target text characters; according to the scheme, the special characters can be screened out from the text to be identified, and the special characters are converted into text characters in the preset text characters, so that the identification accuracy of the text characters in the text to be identified can be greatly improved, and the accuracy of junk text identification in the text to be identified is greatly improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic view of a text recognition scenario provided by an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a text recognition method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of text to be recognized provided by an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an image recognition model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of the location of a first adjacent text character provided by an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a CBOW model provided by an embodiment of the present invention;
FIG. 7 is a schematic diagram of a target text feature provided by an embodiment of the present invention;
FIG. 8 is another flow chart of a text recognition method according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a text recognition device according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of an identification unit of a text identification device according to an embodiment of the present invention;
FIG. 11 is a schematic diagram of another structure of a text recognition device according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
The embodiment of the invention provides a text recognition method, a text recognition device and a computer readable storage medium. The text recognition device can be integrated in an electronic device, and the electronic device can be a server or a terminal.
The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, network acceleration services (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.
For example, referring to fig. 1, taking an example that a text recognition device is integrated in an electronic device, after the electronic device obtains a text to be recognized and fonts of the text to be recognized used by a service platform, the text to be recognized includes a plurality of text characters, text characters which do not belong to a preset text character library are screened out from the text to be recognized, special characters are obtained, the special characters are converted into images according to the fonts to obtain character images, an image recognition model is adopted to recognize the character images, candidate text characters similar to the character images are screened out from the preset text character library, the image recognition model is trained by a plurality of character image samples, the character images are converted into images according to different fonts by the text characters in the preset text character library, target text characters corresponding to the special characters are determined from the candidate text characters according to context information of the special characters in the text to be recognized, and the text to be recognized is recognized based on the target text characters.
The text characters can comprise characters in common Chinese characters, alphabets, punctuation marks, some special characters with near shapes or near sounds and the like. The special characters are difficult to be utilized in a common text representation mode, that is, the characters cannot be directly recognized in text recognition, and the special characters can be single characters or can be a character group formed by a plurality of characters, such as characters like 'c ohm m' or 'color ≡cyan'.
The preset text character library comprises commonly used characters such as single Chinese characters, common phrases formed by the Chinese characters, common words formed by normal numerical letters and letters, common punctuation marks in the punctuation mark library and the like.
The following will describe in detail. The following description of the embodiments is not intended to limit the preferred embodiments.
The present embodiment will be described from the perspective of a text recognition apparatus, which may be integrated in an electronic device, and the electronic device may be a server or a terminal, etc.; the terminal may include a tablet computer, a notebook computer, a personal computer (PC, personal Computer), and the like.
A text recognition method, comprising:
the method comprises the steps of obtaining a text to be recognized and fonts used by the text to be recognized in a service platform, wherein the text to be recognized comprises a plurality of text characters, screening text characters which do not belong to a preset text character library from the text to be recognized, obtaining special characters, converting the special characters into images according to the fonts, obtaining character images, recognizing the character images by adopting an image recognition model to screen candidate text characters similar to the character images from the preset text character library, training the image recognition model by a plurality of character image samples, converting the text characters in the preset text character library into images according to different fonts, determining target text characters corresponding to the special characters from the candidate text characters according to context information of the special characters in the text to be recognized, and recognizing the text to be recognized based on the target text characters.
As shown in fig. 2, the specific flow of the text recognition method is as follows:
101. and acquiring the text to be identified and the fonts used by the text to be identified in the service platform.
The text to be recognized may include a plurality of text characters, for example, may include a plurality of kanji characters, may include a plurality of digits or letters, and may include a character set formed by kanji characters, digits, letters, or punctuation marks.
The service platform is understood to be a platform on which the text characters to be recognized are manufactured or displayed, for example, a PC or a mobile phone, for example, a user issues a section of text to be recognized in a client of the mobile phone, and then the service platform to which the text to be recognized belongs is the mobile phone.
For example, there may be various ways to obtain the text to be identified, for example, the text recognition device may directly obtain the UGC text in the server by sending or uploading the UGC text to the server of the social or instant messaging system, for example, the text recognition device may obtain the UGC text in real time, after the user uploads the UGC text successfully, the server of the social or instant messaging system sends prompt information to the text recognition device, the text recognition device extracts the UGC text uploaded by the user in real time in the server to identify, when the UGC text uploaded by the user is judged as a junk text, the text needs to be intercepted, and after the interception, the social or instant messaging system will not issue the UGC text. UGC texts sent or uploaded by the user can also be acquired from the server at regular time or periodically, for example, UGC texts generated in 10 minutes can be acquired from the server every 10 minutes, UGC texts can be acquired every fixed date and can be set as texts to be identified. The UGC texts can also be directly obtained in the Internet, for example, UGC texts generated in the current time period are periodically or regularly crawled from the Internet, and the UGC texts are used as texts to be identified. When the UGC text is acquired, fonts used in the service platform where the UGC text is manufactured and displayed can also be acquired together, for example, the UGC text is sent to a server by a client in a mobile phone, the service platform to which the UGC text belongs can be determined to be the mobile phone, and font information used in the mobile phone can be determined, for example, the font information can be a font type, a font size, a font color, and the like, wherein the font type can be Song Ti, a bold, a regular script, or other fonts.
The UGC text may be text posted, commented or forwarded by a user in the social or instant messaging system, or may be a nickname or identity of the user in the social or instant messaging system.
In the process of acquiring the text to be identified, a label can be added to the text to be identified according to the identification result of each acquired history text to be identified, after a new text to be identified is acquired, the label of the new text to be identified can be compared with the label of the history text to be identified first, when the acquired new text to be identified is identical to the history text to be identified, the new acquired text to be identified can be identified directly according to the identification result of the history text to be identified, for example, when the identification result of the history text to be identified is a junk text, the new acquired text to be identified can be directly determined as the junk text, and when the identification result of the history text to be identified is a normal text, the new acquired text to be identified can be directly determined as the normal text.
102. And screening text characters which do not belong to a preset text character library from the text to be identified to obtain special characters, and converting the special characters into images according to fonts to obtain character images.
The special characters can be characters which cannot be searched in a preset text character library, are difficult to be used by common text representation modes in practice, namely the characters can not be directly identified in text identification, and can be single characters or can be character groups formed by a plurality of characters, such as characters of the type of 'c ohm m' or 'color ∈blue'.
The character image may be an image in which a special character is displayed on a display platform (may be a client). It is understood that the special character is converted into an image, and the image is displayed as the special character.
For example, comparing and matching text characters in the text to be recognized with characters in a preset text character library, and screening out characters which do not belong to the preset text characters in the text characters to obtain screened special characters. For example, taking the text to be recognized as shown in fig. 3, text characters containing special characters of the text to be recognized are "Mvp, seven nine, C ohm M" and "MVP292, point com, color ∈cyan" as examples, and individual characters divided by the characters are respectively matched in a preset character library, so that the individual characters can be successfully matched, in this case, the characters can be combined, the combined characters are matched with the preset character library again, and it can be found that the combined characters cannot be matched, for example, "point com", "seven nine", "C ohm M" and "color ∈cyan", so that the combined characters can be used as special characters. It can be found that these special characters are near or near, and the special characters replace commonly used phrases composed of normal characters.
The special characters are converted according to fonts, for example, blank images with preset sizes can be created, for example, when the size of an image to be input by the image recognition model is 512×512, the preset size is 512×512. Since there is a difference in character display of different service platforms (such as PC and mobile phone), it is necessary to convert special characters according to the same font as the actual service platform. Therefore, font parameters required by special character conversion are required to be determined according to the acquired fonts of the text to be recognized in the service platform, and the font parameters can be parameters such as font types, font sizes or font colors. And according to the font parameters, adding the special fonts to the blank image and rendering to generate a character image, for example, adding the special characters to the blank image, and according to the font parameters, rendering the special characters added to the blank image to obtain the character image.
103. And recognizing the character image by adopting an image recognition model so as to screen candidate text characters similar to the character image from a preset text character library.
The image recognition model is trained by a plurality of character image samples, and the character image samples are images formed by converting text characters in a preset text character library according to different fonts.
The candidate text characters can be understood as text characters similar to characters in the character image in the preset character library, or can be understood as primary recognition results obtained by recognizing special characters in the preset text character library, and the recognition results can be one-to-one or one-to-many, for example, one special character or character group corresponds to one recognized text character or character group, or can be a plurality of recognized text characters or character groups corresponding to one special character or character group.
For example, an image recognition model may be used to screen candidate text characters similar to a character image from a preset text character library, for example, the image recognition model is used to perform multi-scale feature extraction on the character image to obtain local feature information corresponding to different scales, for example, as shown in fig. 4, a feature extraction module composed of a convolution layer and a pooling layer is used to perform multi-scale feature extraction on the character image to obtain local feature information corresponding to different scales, the number of the feature extraction modules may be two or more, the specific number may be set according to the actual application situation, the local feature information corresponding to different scales is subjected to linear classification through a hidden layer, the classified local feature information is fused to obtain global feature information corresponding to the character image, the global feature information is input to a full-connection layer for processing, the global feature information may be mapped to text characters in the preset text character library, the probability that characters in the character image are text characters in the preset text character library is obtained, candidate text characters similar to the character image may be one or more.
For character image recognition, optical character recognition (Optical Character Recognition, OCR) or other methods may be used to identify candidate text characters similar to the character image in a preset text character library.
The image recognition model can be set according to the requirements of practical applications, in addition, it is required that the image recognition model can be preset by maintenance personnel, and training can also be performed by the text recognition device, namely, before the step of performing multi-scale feature extraction on character images by adopting the image recognition model to obtain local feature information corresponding to different scales, the text recognition method can further comprise the following steps:
(1) A character image sample is acquired, wherein the character image sample comprises character images marked with text characters.
For example, a text character image existing in a preset text character library may be collected as an original data set, for example, a plurality of handwritten letters, digital pictures, character print pictures and character pictures with partially known near-letters may be used as the original data set, and the character images in the original data set are labeled to obtain a character image sample. For example, handwritten letters, digital pictures, character print pictures and character pictures of partially known near-letters are obtained from a database or network, and corresponding text characters are marked on these character images.
(2) And predicting text characters in a preset text character library corresponding to the character image sample by adopting a preset image recognition model to obtain a prediction result.
For example, a preset image recognition model is adopted to conduct multi-scale feature extraction on a character image sample, local feature information corresponding to different scales is obtained, the local feature information is fused, global feature information of the character image sample is obtained, and predicted text characters similar to the character image sample are predicted in a preset text character library according to the global feature information.
(3) And converging the preset image recognition model according to the prediction result and the labeling result of the character image sample to obtain the image recognition model.
For example, in the embodiment of the present application, the preset image recognition model may be converged according to the prediction result and the labeling result by interpolating the loss function, so as to obtain the image recognition model. For example, the following may be specifically mentioned:
and adjusting parameters for screening text characters similar to the character image from a preset text character library in the preset image recognition model according to the prediction result and the labeling result of the character image sample by adopting a Dice function (a loss function), and adjusting parameters for screening text characters similar to the character image from the preset text character library in the preset image recognition model according to the prediction result and the labeling result of the character image sample by interpolating the loss function to obtain the image recognition model.
Alternatively, in order to improve the accuracy of character image recognition, besides the Dice function, other loss functions, such as cross entropy loss functions, may be used for convergence, which may be specifically as follows:
and adjusting parameters for screening text characters similar to the character image from a preset text character library in the preset image recognition model according to the prediction result and the labeling result of the character image sample by adopting a cross entropy loss function, and adjusting parameters for screening text characters similar to the character image sample from the preset text character library in the preset image recognition model according to the prediction result and the labeling result of the character image sample by adopting an interpolation loss function to obtain the image recognition model.
104. And determining a target text character corresponding to the special character from the candidate text characters according to the context information of the special character in the text to be recognized.
For example, when the number of candidate text characters similar to the character image is one, the candidate text characters are used as target text characters corresponding to the special characters, for example, when the character image corresponding to the character "C ohm M" filters that only one candidate text character is "com", the "com" is used as the target text character corresponding to the special character "C ohm M".
When the number of the candidate text characters which are screened and similar to the character image is a plurality of, determining a target text character corresponding to the special character from the candidate text characters according to the context information of the special character in the text to be identified, wherein the target text character can be specifically as follows:
(1) And acquiring the upper text information and the lower text information of the special characters in the text to be recognized.
The context information may be understood as context information and context information of a text to be recognized, where the context information may be information of a text character preceding the special character in the text to be recognized, and the context information may be information of a text character following the special character in the text to be recognized. The context information may generally be obtained from the location information of the special character in the text to be recognized.
For example, the position information of the special character in the text to be recognized can be determined by sorting text characters in the text to be recognized according to the sorting result of the special character in the text to be recognized. For example, the text to be recognized contains 10 text characters altogether, the 10 text characters in the text to be recognized are ordered, the special characters are arranged in the third position of the 10 text characters in the ordering result, then the position information of the special characters in the text to be recognized is determined according to the arrangement or column information of the 10 characters in the text to be recognized, for example, the text to be recognized contains 10 text characters altogether, the special characters can be the positions corresponding to the first third position of the text to be recognized, and the information of the first 2 text characters and the last 7 text characters in the 10 text characters can be used as the context information of the special characters in the text characters to be recognized according to the position information of the special characters.
(2) And screening out the first adjacent text characters of the special characters from the text to be recognized according to the context information of the special characters in the text to be recognized.
Wherein, the adjacent text characters can be characters with the distance from the special characters within a preset distance, and can be understood as context text characters of the special characters in terms of colloquial.
For example, according to the number of characters in the preset text box and the context information of the special characters, the first adjacent text characters of the special characters are selected from the text to be recognized, for example, the context information of the special characters is taken as 8 text characters, the distance between the preset adjacent text boxes is taken as 5 text characters, and then the first adjacent text characters can be the first and second characters on the left and right of the special characters in the 8 text characters, as shown in fig. 5.
(3) Association information of the first adjacent text character with the candidate text character is determined.
Wherein the association information may be understood as a degree of association of the first neighboring text character with the candidate text character.
For example, the language processing model may be used to determine the association information of the first neighboring text character and the candidate text character, for example, the language processing model may be used to determine the association information of the first neighboring text character and the candidate text character according to the common data combination of letters, chinese phrases and the like, for example, a special character is a candidate text character of 0 or o, the association degree of the letters and 0 or o is calculated by using the language processing model to determine the association information, and it is obvious that the association degree of the letters of "c", "m" and "e" and the candidate text character "o" exceeds the association degree of the letters of "c", "m" and "e" and the candidate text character "0".
(4) And determining the target text character corresponding to the special character from the candidate text characters according to the association information.
For example, according to the association information, the target text character corresponding to the special character is determined in the candidate text characters, for example, according to the association information, the association degree of the first adjacent text character and each text character in the candidate text is determined, the candidate text character with the largest association degree with the first adjacent text character is selected from the candidate text, the candidate text characters are used as the target text characters, for example, whether the special character is used as the candidate text character 0 or o, and the first adjacent text characters of the special character are used as "c", "m" and "e" for example, so that the association degree of the first adjacent text characters is larger than the association degree of the candidate text character "o" and the target text character corresponding to the special character is "o".
105. And identifying the text to be identified based on the target text characters.
The text type may be a type corresponding to text characters of the text to be recognized, for example, the text type may include a normal text and a junk text, the normal text may be a text with normal text content, and the junk text may be a text with lower content quality.
For example, special characters of text characters of the text to be recognized can be replaced by target text characters to obtain recognizable text characters of the text to be recognized, feature extraction is performed on the recognizable text characters of the text to be recognized to obtain text features of the text to be recognized, and the text to be recognized is recognized according to the text features of the text to be recognized. The method can be concretely as follows:
and C1, replacing special characters in text characters of the text to be recognized with target text characters to obtain recognizable text characters of the text to be recognized.
The recognizable text characters can be characters recognized through a preset text character library, and popular speaking is that the recognizable text characters of the text to be recognized all belong to the characters in the preset text character library.
For example, replacing a special character in the text characters of the text to be recognized with a target text character to obtain a recognizable text character of the text to be recognized, for example, the text character of the text to be recognized is "mvp979.C ohm M", the special character is "C ohm M", the target text character is "com", and the special character "C ohm M" in the text character of the text to be recognized is replaced with "com", so that the recognizable text character of the text to be recognized is "mvp979.Com".
And C2, extracting features of the identifiable text characters of the text to be identified to obtain text features of the text to be identified.
For example, feature extraction may be performed on identifiable text characters in the text to be identified to obtain text features of the identifiable text characters, and the text features of the identifiable text characters may be fused to obtain text features of the text to be identified, which may be specifically as follows:
(1) And extracting the characteristics of the identifiable text characters in the text to be identified to obtain the text characteristics of the identifiable text characters.
For example, second position information of the identifiable text characters in the text to be identified is obtained, for example, the identifiable text characters in the text to be identified may be ranked, and according to the ranking result, the position of each text character in the identifiable text characters is determined, so as to obtain the second position information of the identifiable text characters in the text to be identified, where the second position information may include ranking information of each text character in the identifiable text characters in the text to be identified. According to the second position information, second adjacent text characters of the identifiable text characters are screened out from the text to be identified, for example, according to the size of a preset text feature box and the second position information, second adjacent text characters of the identifiable text characters are screened out from the text to be identified, for example, the size of the preset text box is 5 text characters, 4 text characters adjacent to each character are determined from the identifiable text characters of the text to be identified, the second adjacent text characters of a first text character A in the text to be identified can be 4 text characters (A+1, A+2, A+3 and A+4) on the right of the text characters, and the second adjacent text characters of a second text character B can be the first text character (B-1) on the left of the text characters and 3 text characters (B+1, B+2 and B+3) on the right of the text characters to be identified.
Feature extraction is performed on the second adjacent text character to obtain text features of the recognizable text character, e.g., the recognizable text can beThe character is subjected to feature extraction to obtain initial text features of the identifiable text characters, for example, continuous Bag-of-Words (CBOW) in word2vec can be used for feature extraction of the identifiable text characters, and one-hot encoding (one-hot) mode can be used for feature extraction of the identifiable text characters to obtain initial text features of each text character of the identifiable text characters. Determining the text feature of the second adjacent text character based on the initial text feature of the recognizable text character, for example, the structure of the CBOW model is shown in fig. 6, and the initial text feature of one text character of the recognizable text characters is w with the size of the preset text feature box being 5 text characters i For example, a second adjacent text character w is screened out of the initial features of the recognizable text character i-2 、w i-1 、w i+1 And w i+2 . And adjusting the initial text characteristics of the identifiable text characters based on the text characteristics of the second adjacent text characters to obtain the text characteristics of the identifiable text characters. For example, for a second adjacent text character w i-2 、w i-1 、w i+1 And w i+2 And carrying out summation and averaging, and adjusting the initial text characteristics of the identifiable text characters according to the average value, for example, adjusting the initial text characteristics of the identifiable text characters to the text characteristics corresponding to the average value to obtain the text characteristics of the identifiable text characters.
The CBOW in word2vec is used for feature extraction, so that better calculation speed can be obtained. Other ways of extracting the initial features of the recognizable text characters can be adopted, for example, a pre-training language model such as ELMo (a language model) can be adopted to extract the features of the text characters.
(2) And carrying out feature fusion on the text features of the identifiable text characters to obtain the text features of the text to be identified.
For example, the text features of the identifiable text characters are fused to obtain a first initial text feature of the text to be identified, for example, word vectors corresponding to each text character in the identifiable text characters are accumulated, so that sentence vectors of the text to be identified can be used as the first initial text feature of the text to be identified.
(3) And screening out text characters which are not repeated from the identifiable text characters, and carrying out feature fusion on the text features of the text characters which are not repeated from each other to obtain second initial text features of the text to be identified.
For example, text characters that are not repeated with each other are selected from among text features of the recognizable text characters, e.g., the recognizable text characters of the text to be recognized include A, A, B, C, C and D, and text characters A, B, C and D that are not repeated with each other are selected from among the recognizable text characters. If the text features of the text characters which are not repeated are fused, for example, if the text characters A, B, C and D which are not repeated are taken as examples, the text features corresponding to the text characters A, B, C and D which are not repeated are accumulated, that is, word vectors, so as to obtain the second initial text feature of the text to be recognized.
(4) And splicing the first initial text feature and the second initial text feature to obtain the text feature of the text to be identified.
For example, the first initial text feature and the second initial text feature may be directly spliced to obtain a text feature of the text to be recognized, for example, the first initial text feature is a 100-dimensional sentence vector, the second initial text feature is an 80-dimensional sentence vector, and then the 100-dimensional sentence vector and the 80-dimensional sentence vector are directly spliced to obtain a 180-dimensional spliced sentence vector, and the 180-dimensional sentence vector is used as the text feature of the text to be recognized.
And C3, identifying the text to be identified according to the text characteristics of the text to be identified.
For example, the similarity between the text features of the text to be recognized and the text features in the preset text feature library can be calculated, and the text to be recognized is recognized according to the similarity. The method can be concretely as follows:
(1) And calculating the similarity between the text features of the text to be identified and the text features in a preset text feature library.
The similarity may be cosine similarity between the text feature of the text to be recognized and the text feature in the preset text feature library, or may be understood as a distance between the text feature of the text to be recognized and the text feature in the preset text feature library.
The preset text feature library comprises a plurality of types or categories of sub-text feature libraries, and the sub-text feature library comprises text features with known text types, wherein the text types can comprise normal text and junk text.
For example, the text feature of the text to be recognized may be segmented to obtain multiple sub-text features, for example, the sentence vector of the text to be recognized may be segmented by using product vectorization (Product Quantizer) in faiss (a search algorithm library), taking the text feature of the text to be recognized as a 100-dimensional sentence vector as an example, the 100-dimensional sentence vector may be segmented into multiple sub-sentence sub-vectors, and the total dimension of the sub-sentence vectors is 100 dimensions. The target sub-text feature library corresponding to the sub-text features is clustered in the preset text feature library, the target sub-text feature library corresponding to each sub-text feature can be quickly found out through clustering the sub-text features in the preset text feature library, the initial similarity between the sub-text features and the text features in the target sub-text feature library is calculated, for example, the distance between the sub-text features and each text feature in the target sub-text feature library can be determined according to the residual error after clustering the text features in the target sub-text feature library, and the initial similarity between the sub-text features of the text to be identified and the text features in the target sub-text feature library is obtained. And fusing the initial similarity to obtain the similarity between the text features of the text to be identified and the text features in the preset text feature library. For example, the text features in the target sub-text feature library may be classified according to a preset text type, for example, the text features in each target sub-text feature library may be classified into a junk text feature set corresponding to a normal text and a junk text feature set corresponding to a junk text, the text features in the normal text feature set and the junk text feature set are ordered according to the size of the similarity, the sequence numbers after the ordering in the normal text feature set and the junk text feature set in each target sub-text feature library are fused, so as to obtain the similarity of the text features of the text to be identified and the text features of the preset text feature library, for example, the text features of the text to be identified include n sub-text features, the similarity of the normal text features of the normal text feature set corresponding to the n sub-text feature sets is multiplied by the similarity of the normal text feature features of the first to be identified, so as to obtain the maximum similarity K1 of the text features of the normal text features in the preset text feature library, the junk text feature library is multiplied by the similarity of the text feature of the corresponding target sub-text feature library, and the junk text feature library is calculated according to the similarity of the text features of the text feature first to be identified. The similarity may include normal text similarity similar to normal text features and spam text similarity similar to spam text features.
Alternatively, the fusion is also another way, that is, the preset normal text feature and the junk text feature are segmented, then the segmented sub-normal text feature and sub-junk text feature are clustered to obtain a plurality of target sub-text feature libraries, for example, the preset normal text feature A is divided into 3 sub-normal text features such as A1, A2 and A3, the preset junk text feature B is divided into 3 sub-junk text features such as B1, B2 and B3, and the initial similarity of A1, A2 and A3 is multiplied to obtain the similarity of the preset normal text feature and the text feature to be identified, which can be called as the normal text similarity, and the initial similarity of B1, B2 and B3 is multiplied to obtain the similarity of the preset junk text feature and the text feature to be identified, which can be called as the junk text similarity. Here, it should be noted that if the sub-text feature in the preset normal text feature has no text feature similar to the sub-text feature of the text to be recognized, the sub-text feature of the preset normal text feature may be discarded, for example, the preset normal text feature a includes 3 sub-normal text features of A1, A2, A3, etc., but the similarity between the sub-text feature A2 of the three word sub-text features and the sub-text feature of the text to be recognized is 0, and when calculating the similarity between the preset normal text feature a and the text feature of the text to be recognized, only the initial similarity between the sub-text feature A1 and the sub-text feature A3 may be multiplied. The same calculation method can be adopted for the preset spam text features.
(2) And identifying the text to be identified according to the similarity.
For example, a preset number of target preset text features are screened out from a preset text feature library according to the similarity, for example, target text features with similarity within a range can be screened out from the preset text feature library, for example, 5 target text features with similarity exceeding a preset similarity threshold value can be selected, and target text features with similarity ranked in front 5 can be selected. When the target text features are all normal text features, the text to be recognized can be determined to be normal text, for example, taking the number of the target text features as 5 as an example, and the 5 target text features are all preset normal text features, so that the text to be recognized can be determined to be normal text. When the target text features are all preset junk text features, the text to be identified can be determined to be junk text. For example, taking the number of target text features as 5 as an example, if all the 5 target text features are preset junk text features, it can be determined that the text to be identified is junk text.
Optionally, when the preset normal text feature and the preset junk text feature exist in the target preset text feature, the normal text feature and the junk text feature are screened out from the similarities, for example, the similarity between the text feature of the text to be identified and the preset normal text feature in the target text feature is screened out from the similarities corresponding to the target text features, the similarities are called normal text similarity, and similarly, the similarity between the text feature of the text to be identified and the preset junk text feature in the target text feature is screened out from the similarities corresponding to the target text features, and the similarities are called junk text similarity. And identifying the text to be identified according to the normal text similarity and the junk text similarity. For example, the normal text similarity and the junk text similarity are weighted respectively to obtain a first weighted value of the normal text similarity and a second weighted value of the junk text similarity, for example, as shown in fig. 7, the target text feature includes 2 preset normal text features and 3 preset junk text features, the similarity of the 2 preset normal text features is weighted according to weights corresponding to the 2 preset normal text features to obtain a first weighted value corresponding to the normal text similarity, and the similarity of the 3 preset junk text features is weighted according to weights corresponding to the 3 preset junk text features to obtain a second weighted value corresponding to the junk text similarity. By comparing the two weighted values, the recognition result of the text to be recognized can be determined, for example, when the first weighted value exceeds the second weighted value, the text to be recognized can be determined to be normal text, and when the first weighted value does not exceed the second weighted value, the text to be recognized can be determined to be junk text.
Optionally, when the text to be identified is a junk text, for example, when the text to be identified is a junk text, the text to be identified needs to be intercepted, and specific interception modes may be various, for example, if the text to be identified is already published on the internet, recall and delete the text to be identified need to be performed, if the text to be identified is only uploaded to a server or a database, and if the text to be identified is not yet published on the internet, interception and delete processes can be directly performed on the server or the database. In the process of intercepting processing, the identity of the terminal sending or releasing the text to be identified can be obtained, other text information released by the terminal corresponding to the identity is checked again, the proportion of the released junk text to all text information is determined according to the checking result, and if the proportion of the released junk text to all text information exceeds a preset proportion threshold value, the terminal corresponding to the identity can prohibit intercepting processing modes such as releasing text information and the like.
Optionally, when the text to be identified is a normal text, the text to be identified does not need to be intercepted and passed, so that the text to be identified can be published to the internet or other social platforms, or can be continuously displayed or displayed on the internet or other social platforms, and the like.
As can be seen from the foregoing, in the embodiment of the present application, after obtaining a text to be recognized and a font of the text to be recognized used by a service platform, the text to be recognized includes a plurality of text characters, text characters that do not belong to a preset text character library are screened out from the text to be recognized, special characters are obtained, the special characters are converted into images according to the fonts, character images are obtained, an image recognition model is used to recognize the character images, so as to screen out candidate text characters similar to the character images in the preset text character library, the image recognition model is trained by a plurality of character image samples, the character image samples are converted into images according to different fonts by the text characters in the preset text character library, a target text character corresponding to the special characters is determined in the candidate text characters according to context information of the special characters, and the text to be recognized is recognized based on the target text characters; according to the scheme, the special characters can be screened out from the text to be identified, and the special characters are converted into text characters in the preset text characters, so that the identification accuracy of the text characters in the text to be identified can be greatly improved, and the accuracy of junk text identification in the text to be identified is greatly improved.
According to the method described in the above embodiments, examples are described in further detail below.
In this embodiment, the text recognition device is specifically integrated in an electronic device, and the electronic device is specifically a server.
As shown in fig. 8, a text recognition method specifically includes the following steps:
201. the server acquires the text to be identified and the fonts used by the text to be identified in the service platform.
For example, when a user uploads or sends text to be identified to a database of the social or instant messaging system through the social or instant messaging system, the server can obtain UGC text uploaded by the user from the database in real time. The server can also periodically or regularly acquire UGC texts sent or uploaded by the user from the database, and can also periodically and regularly directly crawl UGC texts generated in the current time period from the Internet, and the UGC texts are used as texts to be identified. When the UGC text is acquired, fonts used in the service platforms for manufacturing and displaying the UGC text can be acquired together, for example, the UGC text is sent to a server by a client in a mobile phone, the service platform to which the UGC text belongs can be determined to be the mobile phone, and font information used in the mobile phone is determined.
202. The server screens text characters which do not belong to a preset text character library from the text to be identified, so as to obtain special characters, and converts the special characters into images according to fonts, so as to obtain character images.
For example, taking the text to be recognized as shown in fig. 3, the text characters of the text to be recognized are "Mvp Jiujiu.c ohm M" and "MVP292 @ point com ∈cyan" respectively as examples, the server matches the individual characters divided by the characters in the preset character library respectively, so that it can be found that the individual characters can be successfully matched, in this case, the characters can be combined, the combined characters can be matched with the preset character library again, and it can be found that the combined characters cannot be matched, for example, "point com", "Jiujiu", "C ohm M" and "color ∈cyan", so that the combined characters can be used as special characters.
The server converts the special characters into images according to fonts, for example, the server takes the image size which is required to be input by an image recognition model as a preset size, constructs a blank image corresponding to the preset size, determines font parameters required by the special character conversion according to the acquired fonts of the text to be recognized in the service platform, adds the special characters into the blank image, and renders the special characters added into the blank image according to the font parameters to obtain the character image.
203. The server adopts an image recognition model to recognize the character image so as to screen candidate text characters similar to the character image from a preset text character library.
For example, the server performs multi-scale feature extraction on the character image by using an image recognition model to obtain local feature information corresponding to different scales, for example, a feature extraction module consisting of a convolution layer and a pooling layer is used to perform multi-scale feature extraction on the character image to obtain local feature information corresponding to different scales, and the number of the feature extraction modules can be two or more, and the specific number of the feature extraction modules can be set according to practical application conditions. The local feature information corresponding to different scales is subjected to linear classification through the hidden layer, the classified local feature information is fused to obtain global feature information corresponding to the character image, the global feature information is input to the full-connection layer for processing, the global feature information can be mapped to text characters in a preset text character library, the probability that the characters in the character image are the text characters in the preset text character library is obtained, candidate text characters similar to the character image are selected from the text characters, and the candidate text characters can be one or a plurality of candidate text characters.
For character image recognition, optical character recognition (Optical Character Recognition, OCR) or other methods may be used to identify candidate text characters similar to the character image in a preset text character library.
The image recognition model can be set according to the requirements of practical applications, in addition, it is required that the image recognition model can be preset by maintenance personnel, and training can also be performed by the text recognition device, namely, before the step of performing multi-scale feature extraction on character images by adopting the image recognition model to obtain local feature information corresponding to different scales, the text recognition method can further comprise the following steps:
(1) The server collects a character image sample, wherein the character image sample comprises character images marked with text characters.
For example, the server may collect text character images existing in a preset text character library as an original data set, for example, a plurality of handwritten letters, digital pictures, character print pictures and character pictures with partially known near-shape letters may be used as the original data set, and the character images in the original data set are labeled to obtain a character image sample. For example, handwritten letters, digital pictures, character print pictures and character pictures of partially known near-letters are obtained from a database or network, and corresponding text characters are marked on these character images.
(2) And predicting text characters in a preset text character library corresponding to the character image by using the preset image recognition model by the server to obtain a prediction result.
For example, a preset image recognition model is adopted to conduct multi-scale feature extraction on a character image sample, local feature information corresponding to different scales is obtained, the local feature information is fused, global feature information of the character image sample is obtained, and predicted text characters similar to the character image sample are predicted in a preset text character library according to the global feature information.
(3) And the server converges the preset image recognition model according to the prediction result and the labeling result of the character image sample to obtain the image recognition model.
For example, in the embodiment of the present application, the preset image recognition model may be converged according to the prediction result and the labeling result by interpolating the loss function, so as to obtain the image recognition model. For example, the following may be specifically mentioned:
and adjusting parameters for screening text characters similar to the character image from a preset text character library in the preset image recognition model according to the prediction result and the labeling result of the character image sample by adopting a Dice function, and adjusting parameters for screening text characters similar to the character image from the preset text character library in the preset image recognition model according to the prediction result and the labeling result of the character image sample by adopting an interpolation loss function to obtain the image recognition model.
Alternatively, in order to improve the accuracy of character image recognition, besides the Dice function, other loss functions, such as cross entropy loss functions, may be used for convergence, which may be specifically as follows:
and adjusting parameters for screening text characters similar to the character image from a preset text character library in the preset image recognition model according to the prediction result and the labeling result of the character image sample by adopting a cross entropy loss function, and adjusting parameters for screening text characters similar to the character image sample from the preset text character library in the preset image recognition model according to the prediction result and the labeling result of the character image sample by adopting an interpolation loss function to obtain the image recognition model.
204. And the server determines a target text character corresponding to the special character from the candidate text characters according to the context information of the special character in the text to be identified.
For example, when the number of candidate text characters screened out to be similar to the character image is one, the server regards the candidate text characters as target text characters corresponding to the special characters.
When the number of the candidate text characters which are screened and similar to the character image is a plurality of, the server determines a target text character corresponding to the special character from the candidate text characters according to the context information of the special character in the text to be identified, wherein the target text character can be specifically as follows:
(1) And the server acquires the text information of the special characters in the text to be recognized.
For example, taking 10 text characters in the text to be recognized as an example, the server ranks 10 text characters in the text to be recognized, ranks the special characters in the third position of the 10 text characters in the ranking result, determines the position information of the special characters in the text to be recognized according to the ranking or column information of the 10 characters in the text to be recognized, for example, whether the text to be recognized contains 10 text characters in total, the special characters can be the positions corresponding to the first third position of the text to be recognized, and the information of the first 2 text characters and the last 7 text characters in the 10 text characters can be used as the context information of the special characters in the text to be recognized according to the position information of the special characters.
(2) The server screens out the first adjacent text characters of the special characters from the text to be recognized according to the context information of the special characters in the text to be recognized
For example, taking the context information of a special character as 8 text characters and the distance between preset adjacent text boxes as 5 text characters as an example, the server determines that the first adjacent text character can be the first and second characters to the left and right of the special character in the 8 text characters.
(3) The server determines association information of the first adjacent text character with the candidate text character.
For example, the server may determine the association information of the first adjacent text character and the candidate text character according to the common data combination of letters, chinese phrases and the like by using a language processing model, for example, the special character is the candidate text character of 0 or o, the first adjacent text character of the special character is "c", "m" and "e", and the association degree of the letters and 0 or o is calculated by using the language processing model to determine the association information, so that it is obvious that the association degree of the letters of "c", "m" and "e" and the candidate text character of "o" exceeds the association degree of the letters of "c", "m" and "e" and the candidate text character of "0".
(4) And the server determines a target text character corresponding to the special character from the candidate text characters according to the association information.
For example, the server determines, according to the association information, the association degree of the first adjacent text character with each text character in the candidate text, and screens out the candidate text characters with the highest association degree with the first adjacent text character, and uses these candidate text characters as target text characters, for example, whether the candidate text characters are 0 or o, and if the first adjacent text characters of the special character are "c", "m" and "e", it may be determined that the association degree of the first adjacent text characters are "c", "m" and "e" with the candidate text characters "o" is greater, and therefore, the target text character corresponding to the special character is "o".
205. And the server replaces special characters in text characters of the text to be recognized with target text characters to obtain recognizable text characters of the text to be recognized.
For example, taking the text character of the text to be recognized as "mvp979.C ohm", the special character as "C ohm", the target text character as "com", the server replaces the special character "C ohm" in the text characters of the text to be recognized with "com", and the recognizable text character of the text to be recognized is "mvp979.Com" can be obtained.
206. And the server performs feature extraction on the recognizable text characters of the text to be recognized so as to obtain text features of the text to be recognized.
For example, the server may perform feature extraction on identifiable text characters in the text to be identified to obtain text features of the identifiable text characters, and fuse the text features of the identifiable text characters to obtain text features of the text to be identified, which may be specifically as follows:
(1) And the server performs feature extraction on the identifiable text characters in the text to be identified to obtain text features of the identifiable text characters.
For example, the server may sort the identifiable text characters in the text to be recognized, and determine a position of each text character in the identifiable text characters according to the sorting result, so as to obtain second position information of the identifiable text characters in the text to be recognized. According to the size of the preset text feature box and the second position information, the second adjacent text characters of the identifiable text characters are screened out from the text to be identified, for example, the size of the preset text box is 5 text characters, 4 text characters adjacent to each character are determined in the identifiable text characters of the text to be identified, the second adjacent text characters of the first text character A in the text to be identified can be 4 text characters (A+1, A+2, A+3 and A+4) on the right of the text characters, the second adjacent text characters of the second text character B can be the first text character (B-1) on the left of the text characters and the 3 text characters (B+1, B+2 and B+3) on the right of the text characters, and in this way, the second adjacent text characters corresponding to all text characters of the identifiable text characters of the text to be identified can be determined.
The server can adopt a CBOW model in word2vec to perform feature extraction on the identifiable text characters, and the feature extraction on the identifiable text characters can adopt a single-hot encoding mode to obtain initial text features of each text character of the identifiable text characters. Determining the text feature of the second adjacent text character based on the initial text feature of the recognizable text character, e.g., w, which is the initial text feature of one of the recognizable text characters, with the size of the preset text feature box being 5 text characters i For example, a second adjacent text character w is screened out of the initial features of the recognizable text character i-2 、w i-1 、w i+1 And w i+2 . Based on the text characteristics of the second adjacent text character, the second adjacent text character w i-2 、w i-1 、w i+1 And w i+2 And carrying out summation and averaging, and adjusting the initial text characteristics of the identifiable text characters according to the average value, for example, adjusting the initial text characteristics of the identifiable text characters to the text characteristics corresponding to the average value to obtain the text characteristics of the identifiable text characters.
(2) And the server performs feature fusion on the text features of the identifiable text characters to obtain the text features of the text to be identified.
For example, the server accumulates word vectors corresponding to each text character in the recognizable text characters, so that sentence vectors of the text to be recognized can be used as first initial text features of the text to be recognized.
(3) And the server screens out text characters which are not repeated from the identifiable text characters, and performs feature fusion on the text features of the text characters which are not repeated from each other to obtain second initial text features of the text to be identified.
For example, taking the example that recognizable text characters of the text to be recognized include A, A, B, C, C and D, the server screens out text characters A, B, C and D that do not overlap each other among these recognizable text characters. And then accumulating text features corresponding to the text characters A, B, C and D which are not repeated, namely word vectors, to obtain a second initial text feature of the text to be recognized.
(4) And the server splices the first initial text feature and the second initial text feature to obtain the text feature of the text to be recognized.
For example, taking the first initial text feature as a 100-dimensional sentence vector, taking the second initial text feature as an 80-dimensional sentence vector as an example, the server directly splices the 100-dimensional sentence vector and the 80-dimensional sentence vector to obtain a 180-dimensional spliced sentence vector, and taking the 180-dimensional sentence vector as the text feature of the text to be recognized.
207. And the server identifies the text to be identified according to the text characteristics of the text to be identified.
For example, the server may calculate a similarity between the text feature of the text to be recognized and the text feature in the preset text feature library, and recognize the text to be recognized according to the similarity. The method can be concretely as follows:
(1) And the server calculates the similarity between the text characteristics of the text to be identified and the text characteristics in a preset text characteristic library.
For example, the server may segment a sentence vector of the text to be recognized using Product Quantizer in faiss, taking the text feature of the text to be recognized as a 100-dimensional sentence vector as an example, and the server may segment the 100-dimensional sentence vector into a plurality of clause sub-vectors, where the total dimension of the clause sub-vectors is 100 dimensions. The server clusters target sub-text feature libraries corresponding to the sub-text features in a preset text feature library, the target sub-text feature libraries corresponding to the sub-text features can be quickly found by clustering the sub-text features in the preset text feature library, and the server can determine the distance between the sub-text features and each text feature in the target sub-text feature library according to the residual error after clustering the text features in the target sub-text feature library, so that the initial similarity between the sub-text features of the text to be identified and the text features in the target sub-text feature library is obtained. And then fusing the initial similarity to obtain the similarity between the text features of the text to be identified and the text features in the preset text feature library. The merging mode may be various, for example, the server may sort the text features in the target sub-text feature library according to a preset text type, for example, the text features in each target sub-text feature library may be divided into a junk text feature set corresponding to a normal text and a junk text feature set corresponding to a junk text, the text features in the normal text feature set and the junk text feature set are sorted according to the size of the similarity, the normal text feature set in each target sub-text feature library and the sequence number after the sorting is arranged in the junk text feature set are merged to obtain the similarity of the text features of the text to be identified and the text features in the preset text feature library, for example, the text features to be identified include n sub-text features, the similarity of the normal text features of the normal text feature set corresponding to the n sub-text feature libraries is multiplied by the similarity of the first normal text feature set, the maximum similarity K1 of the text features of the text to be identified and the junk text feature in the preset text feature library can be obtained, the similarity of the text features of the text to be identified in the corresponding to be identified is calculated by the similarity of the text feature library to be the maximum similarity of the text feature library to be identified according to the similarity of the text feature of the text to be identified in the preset text feature library. The similarity may include normal text similarity similar to normal text features and spam text similarity similar to spam text features.
Optionally, the method of fusing the initial similarity is still another method, the server segments the preset normal text feature and the junk text feature, clusters the segmented sub-normal text feature and the sub-junk text feature to obtain a plurality of target sub-text feature libraries, for example, the preset normal text feature A is divided into 3 sub-normal text features such as A1, A2 and A3, the preset junk text feature B is divided into 3 sub-junk text features such as B1, B2 and B3, and the initial similarity of A1, A2 and A3 is multiplied to obtain the similarity of the preset normal text feature and the text feature to be identified, which can be called as the normal text similarity, and similarly, the initial similarity of B1, B2 and B3 is multiplied to obtain the similarity of the preset junk text feature and the text feature to be identified, which can be called as the junk text similarity. Here, it should be noted that if the sub-text feature in the preset normal text feature has no text feature similar to the sub-text feature of the text to be recognized, the sub-text feature of the preset normal text feature may be discarded, for example, the preset normal text feature a includes 3 sub-normal text features of A1, A2, A3, etc., but the similarity between the sub-text feature A2 of the three word sub-text features and the sub-text feature of the text to be recognized is 0, and when calculating the similarity between the preset normal text feature a and the text feature of the text to be recognized, only the initial similarity between the sub-text feature A1 and the sub-text feature A3 may be multiplied. The same calculation method can be adopted for the preset spam text features.
(2) And the server identifies the text to be identified according to the similarity.
For example, the server may screen out target text features with similarity within a range from a preset text feature library, for example, may be 5 target text features with similarity exceeding a preset similarity threshold, and may also be target text features with similarity ranked in the top 5. When the target text features are all normal text features, the text to be recognized can be determined to be normal text, and when the target text features are all preset junk text features, the text to be recognized can be determined to be junk text. When the preset normal text features and the preset junk text features exist in the target preset text features, the server screens out the similarities between the text features of the text to be identified and the preset normal text features in the target text features from the similarities corresponding to the target text features, the similarities are called normal text similarities, and the similarities between the text features of the text to be identified and the preset junk text features in the target text features are screened out from the similarities corresponding to the target text features, and the similarities are called junk text similarities. And identifying the text to be identified according to the normal text similarity and the junk text similarity. For example, the normal text similarity and the junk text similarity are weighted respectively to obtain a first weighted value of the normal text similarity and a second weighted value of the junk text similarity, for example, whether the target text feature contains 2 preset normal text features and 3 preset junk text features is taken as an example, the similarity of the 2 preset normal text features is weighted according to weights corresponding to the 2 preset normal text features to obtain a first weighted value corresponding to the normal text similarity, and the similarity of the 3 preset junk text features is weighted according to weights corresponding to the 3 preset junk text features to obtain a second weighted value corresponding to the junk text similarity. By comparing the sizes of the two weighted values, the text to be recognized can be recognized, and the recognition result of the text to be recognized can be determined, for example, when the first weighted value exceeds the second weighted value, the text to be recognized can be determined to be a normal text, and when the first weighted value does not exceed the second weighted value, the text to be recognized can be determined to be a junk text.
Optionally, when the text to be identified is a junk text, the server needs to intercept the text to be identified, and different intercepting measures can be adopted according to different states of the text to be identified, for example, when the text to be identified is published on the internet, the server can recall and delete the text to be identified, and when the text to be identified is not published on the internet, the server intercepts and deletes the text to be identified in a database of the server. In the process of intercepting the text to be identified, the server can also acquire the identity of the terminal which sends or issues the text to be identified, check other text information issued by the terminal corresponding to the identity again, determine the proportion of the issued junk text to all text information according to the checking result, and if the proportion of the issued junk text to all text information exceeds a preset proportion threshold value, the terminal corresponding to the identity can also prohibit the interception processing modes such as issuing text information.
Optionally, when the text to be recognized is a normal text, in this case, the server does not need to intercept the text to be recognized, but rather passes the text to be recognized, for example, when the text to be recognized is already published to the internet, the server does not need to process the text to be recognized, and when the text to be recognized is not yet published to the internet, the server can pass the text to be recognized, so that the text to be recognized can be published to the internet.
As can be seen from the foregoing, in the embodiment of the present application, after obtaining a text to be recognized and a font of the text to be recognized used by the service platform, the text to be recognized includes a plurality of text characters, text characters that do not belong to a preset text character library are screened out from the text to be recognized, special characters are obtained, the special characters are converted into images according to the fonts, character images are obtained, an image recognition model is used to recognize the character images, so that candidate text characters similar to the character images are screened out from the preset text character library, the image recognition model is trained by a plurality of character image samples, the character image samples are converted into images according to different fonts from the text characters in the preset text character library, a target text character corresponding to the special characters is determined from the candidate text characters according to context information of the special characters, and the text to be recognized is recognized based on the target text characters; according to the scheme, the special characters can be screened out from the text to be identified, and the special characters are converted into text characters in the preset text characters, so that the identification accuracy of the text characters in the text to be identified can be greatly improved, and the accuracy of junk text identification in the text to be identified is greatly improved.
In order to better implement the above method, the embodiment of the present invention further provides a text recognition device, where the text recognition device may be integrated into an electronic device, such as a server or a terminal, where the terminal may include a tablet computer, a notebook computer, and/or a personal computer.
For example, as shown in fig. 9, the text recognition apparatus may include an acquisition unit 301, a conversion unit 302, a screening unit 303, a determination unit 304, and a recognition unit 305, as follows:
(1) An acquisition unit 301;
the obtaining unit 301 is configured to obtain a text to be identified and a font used by the text to be identified in the service platform, where the text to be identified includes a plurality of text characters.
For example, the obtaining unit 301 may be specifically configured to obtain, in a server or a database of a social or instant messaging system, UGC texts sent or uploaded by a user, and periodically crawl UGC texts generated in a current time period from the internet, where the UGC texts are used as texts to be identified. When the UGC text is acquired, fonts used in the service platform in which the UGC text is manufactured and displayed can also be acquired together.
(2) A conversion unit 302;
and the conversion unit 302 is configured to screen text characters not belonging to the preset text character library from the text characters to be recognized, obtain special characters, and convert the special characters into images according to fonts, so as to obtain character images.
For example, the converting unit 302 may be specifically configured to compare and match characters in a text character of a text to be recognized with characters in a preset text character library, screen characters in the text characters that do not belong to the preset text characters, obtain screened special characters, and convert the special characters into images according to fonts, so as to obtain character images.
(3) A screening unit 303;
and a screening unit 303, configured to identify the character image by using the image identification model, so as to screen candidate text characters similar to the character image from a preset text character library.
For example, the filtering unit 303 may be specifically configured to perform multi-scale feature extraction on a character image by using an image recognition model to obtain local feature information corresponding to different scales, perform linear classification on the local feature information corresponding to different scales through a hidden layer, fuse the classified local feature information to obtain global feature information corresponding to the character image, and screen candidate text characters similar to the character image from a preset text character library according to the global feature information.
(4) A determination unit 304;
and the determining unit 304 is configured to determine, according to the context information of the special character in the text to be recognized, a target text character corresponding to the special character from the candidate text characters.
For example, the determining unit 304 may be specifically configured to, when the number of the candidate text characters that are screened out and similar to the character image is one, take the candidate text characters as target text characters corresponding to the special characters, and when the number of the candidate text characters that are screened out and similar to the character image is multiple, determine, according to the context information of the special characters in the text to be identified, the target text characters corresponding to the special characters from the candidate text characters.
(5) An identification unit 305;
a recognition unit 305, configured to recognize a text to be recognized based on the target text character;
the recognition unit 305 may further include a replacement subunit 3051, an extraction subunit 3052, and a recognition subunit 3053, as shown in fig. 10,
a replacing subunit 3051, configured to replace a special character in a text character of the text to be identified with a target text character, so as to obtain an identifiable text character of the text to be identified;
the extracting subunit 3052 is configured to perform feature extraction on identifiable text characters of the text to be identified, so as to obtain text features of the text to be identified;
and the determining subunit 3053 is configured to identify the text to be identified according to the text characteristics of the text to be identified.
For example, the replacing subunit 3051 replaces a special character in the text characters of the text to be identified with a target text character to obtain identifiable text characters of the text to be identified, the extracting subunit 3052 performs feature extraction on the identifiable text characters of the text to be identified to obtain text features of the text to be identified, and the determining subunit 3053 identifies the text to be identified according to the text features of the text to be identified.
Optionally, the text recognition device may further include an acquisition unit 307 and a training unit 308, as shown in fig. 11, specifically may be as follows:
an acquisition unit 306, configured to acquire a character image sample, where the character image sample includes a character image with text characters marked;
the training unit 307 is configured to predict text characters in a preset text character library corresponding to the character image sample by using a preset image recognition model to obtain a prediction result, and converge the preset recognition model according to the prediction result and the labeling result of the character image sample to obtain an image recognition model.
For example, the collection unit 307 collects a character image sample, where the character image sample includes a character image with text characters marked thereon, and the training unit 308 predicts text characters in a preset text character library corresponding to the character image sample by using a preset image recognition model to obtain a prediction result, and converges the preset recognition model according to the prediction result and the marking result of the character image sample to obtain an image recognition model.
In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.
As can be seen from the foregoing, in this embodiment, after the obtaining unit 301 obtains the text to be recognized and the font used by the service platform to which the text to be recognized belongs, the text to be recognized includes a plurality of text characters, the converting unit 302 screens text characters in a preset text character library from the text to be recognized to obtain special characters, and converts the special characters into images according to the fonts to obtain character images, the screening unit 303 uses an image recognition model to recognize the character images, so as to screen candidate text characters similar to the character images in the preset text character library, the image recognition model is trained by a plurality of character image samples, the character image samples are images formed by converting text characters in the preset text character library according to different fonts, the determining unit 304 determines target text characters corresponding to the special characters in the candidate text characters according to context information of the special characters, and the recognizing unit 305 recognizes the text to be recognized based on the target text characters; according to the scheme, the special characters can be screened out from the text to be identified, and the special characters are converted into text characters in the preset text characters, so that the identification accuracy of the text characters in the text to be identified can be greatly improved, and the accuracy of junk text identification in the text to be identified is greatly improved.
The embodiment of the invention also provides an electronic device, as shown in fig. 12, which shows a schematic structural diagram of the electronic device according to the embodiment of the invention, specifically:
the electronic device may include one or more processing cores 'processors 401, one or more computer-readable storage media's memory 402, power supply 403, and input unit 404, among other components. Those skilled in the art will appreciate that the electronic device structure shown in fig. 12 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components. Wherein:
the processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.
The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.
The electronic device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
The electronic device may further comprise an input unit 404, which input unit 404 may be used for receiving input digital or character information and generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.
Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:
the method comprises the steps of obtaining a text to be recognized and fonts used by the text to be recognized in a service platform, wherein the text to be recognized comprises a plurality of text characters, screening text characters which do not belong to a preset text character library from the text to be recognized, obtaining special characters, converting the special characters into images according to the fonts, obtaining character images, recognizing the character images by adopting an image recognition model to screen candidate text characters similar to the character images from the preset text character library, training the image recognition model by a plurality of character image samples, converting the character image samples into images according to different fonts by the text characters in the preset text character library, determining target text characters corresponding to the special characters from the candidate text characters according to context information of the special characters in the text to be recognized, and recognizing the text to be recognized based on the target text characters.
For example, a text to be recognized and a font used by the text to be recognized in a service platform are obtained, characters in text characters and characters in a preset text character library are compared and matched, characters in the text characters, which do not belong to the preset text characters, are screened out, screened special characters are obtained, and the special characters are converted into images according to the fonts, so that character images are obtained. And carrying out multi-scale feature extraction on the character image by adopting an image recognition model to obtain local feature information corresponding to different scales, carrying out linear classification on the local feature information corresponding to different scales through a hidden layer, fusing the classified local feature information to obtain global feature information corresponding to the character image, and screening candidate text characters similar to the character image from a preset text character library according to the global feature information. When the number of the candidate text characters which are screened out and are similar to the character image is one, the candidate text characters are used as target text characters corresponding to the special characters, and when the number of the candidate text characters which are screened out and are similar to the character image is a plurality of candidate text characters, the target text characters corresponding to the special characters are determined from the candidate text characters according to the context information of the special characters in the text to be identified. And replacing special characters in text characters of the text to be identified with target text characters to obtain identifiable text characters of the text to be identified, extracting features of the identifiable text characters of the text to be identified to obtain text features of the text to be identified, and identifying the text to be identified according to the text features of the text to be identified. When the text to be identified is the junk text, intercepting the text to be identified.
The specific implementation of each operation may be referred to the previous embodiments, and will not be described herein.
As can be seen from the foregoing, after obtaining a text to be recognized and a font of the text to be recognized, which is used by a service platform, the text to be recognized includes a plurality of text characters, text characters that do not belong to a preset text character library are screened out from the text to be recognized, special characters are obtained, the special characters are converted into images according to the fonts, character images are obtained, an image recognition model is used to recognize the character images, so that candidate text characters similar to the character images are screened out from the preset text character library, the image recognition model is trained by a plurality of character image samples, the character image samples are converted from text characters in the preset text character library according to different fonts, a target text character corresponding to the special characters is determined from the candidate text characters according to context information of the special characters, and the text to be recognized is recognized based on the target text characters; according to the scheme, the special characters can be screened out from the text to be identified, and the special characters are converted into text characters in the preset text characters, so that the identification accuracy of the text characters in the text to be identified can be greatly improved, and the accuracy of junk text identification in the text to be identified is greatly improved.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present invention provide a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform any of the steps of the text recognition method provided by the embodiments of the present invention. For example, the instructions may perform the steps of:
the method comprises the steps of obtaining a text to be recognized and fonts used by the text to be recognized in a service platform, wherein the text to be recognized comprises a plurality of text characters, screening text characters which do not belong to a preset text character library from the text to be recognized, obtaining special characters, converting the special characters into images according to the fonts, obtaining character images, recognizing the character images by adopting an image recognition model to screen candidate text characters similar to the character images from the preset text character library, training the image recognition model by a plurality of character image samples, converting the character image samples into images according to different fonts by the text characters in the preset text character library, determining target text characters corresponding to the special characters from the candidate text characters according to context information of the special characters in the text to be recognized, and recognizing the text to be recognized based on the target text characters.
For example, a text to be recognized and a font used by the text to be recognized in a service platform are obtained, characters in text characters and characters in a preset text character library are compared and matched, characters in the text characters, which do not belong to the preset text characters, are screened out, screened special characters are obtained, and the special characters are converted into images according to the fonts, so that character images are obtained. And carrying out multi-scale feature extraction on the character image by adopting an image recognition model to obtain local feature information corresponding to different scales, carrying out linear classification on the local feature information corresponding to different scales through a hidden layer, fusing the classified local feature information to obtain global feature information corresponding to the character image, and screening candidate text characters similar to the character image from a preset text character library according to the global feature information. When the number of the candidate text characters which are screened out and are similar to the character image is one, the candidate text characters are used as target text characters corresponding to the special characters, and when the number of the candidate text characters which are screened out and are similar to the character image is a plurality of candidate text characters, the target text characters corresponding to the special characters are determined from the candidate text characters according to the context information of the special characters in the text to be identified. And replacing special characters in text characters of the text to be identified with target text characters to obtain identifiable text characters of the text to be identified, extracting features of the identifiable text characters of the text to be identified to obtain text features of the text to be identified, and identifying the text to be identified according to the text features of the text to be identified. When the text to be identified is the junk text, intercepting the text to be identified.
The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.
Wherein the computer-readable storage medium may comprise: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
Because the instructions stored in the computer readable storage medium can execute the steps in any text recognition method provided by the embodiments of the present invention, the beneficial effects that any text recognition method provided by the embodiments of the present invention can achieve can be achieved, which are detailed in the previous embodiments and are not described herein.
The foregoing has described in detail a text recognition method, apparatus and computer readable storage medium provided by embodiments of the present invention, and specific examples have been applied herein to illustrate the principles and implementations of the present invention, the above description of the embodiments being only for aiding in the understanding of the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present invention, the present description should not be construed as limiting the present invention.

Claims (16)

1. A method of text recognition, comprising:
acquiring a text to be identified and font information used by the text to be identified in a service platform, wherein the text to be identified comprises a plurality of text characters, and the font information comprises font types, font sizes and font colors;
selecting text characters which do not belong to a preset text character library from the text to be identified, obtaining special characters, and converting the special characters into images according to the font information to obtain character images;
identifying the character image by adopting an image identification model to screen candidate text characters similar to the character image from the preset text character library, wherein the image identification model is trained by a plurality of character image samples, and the character image samples are images formed by converting text characters in the preset text character library according to different font information;
when the number of the candidate text characters which are screened out and are similar to the character image is a plurality of, screening out a first adjacent text character of the special character from the text to be identified according to the context information of the special character in the text to be identified;
Determining association information of the first adjacent text character and the candidate text character;
determining a target text character corresponding to the special character from the candidate text characters according to the association information;
and identifying the text to be identified based on the target text characters to obtain the text types of the text to be identified, wherein the text types comprise normal texts and junk texts.
2. The text recognition method of claim 1, wherein the employing an image recognition model to recognize the character image to screen candidate text characters similar to the character image from the preset text character library comprises:
carrying out multi-scale feature extraction on the character image by adopting an image recognition model to obtain local feature information corresponding to different scales;
fusing the local feature information to obtain global feature information of the character image;
and screening one or more candidate text characters similar to the character image from the preset text character library according to the global characteristic information.
3. The text recognition method of claim 2, wherein the method further comprises:
And when the number of the candidate text characters which are similar to the character image is one, taking the candidate text characters as target text characters corresponding to the special characters.
4. The text recognition method according to claim 1, wherein the recognizing the text to be recognized based on the target text character includes:
replacing special characters in text characters of the text to be identified with the target text characters to obtain identifiable text characters of the text to be identified, wherein the identifiable text characters can be identified through the preset text character library;
extracting features of the identifiable text characters of the text characters to be identified to obtain text features of the text to be identified;
and identifying the text to be identified according to the text characteristics of the text to be identified.
5. The method for recognizing text according to claim 4, wherein the feature extraction of the recognizable text characters of the text characters to be recognized to obtain the text features of the text to be recognized comprises:
extracting features of the identifiable text characters of the text to be identified to obtain text features of the identifiable text characters;
And fusing the text characteristics of the identifiable text characters to obtain the text characteristics of the text to be identified.
6. The method for recognizing text according to claim 5, wherein the feature extraction of the recognizable text characters of the text to be recognized to obtain the text features of the recognizable text characters comprises:
acquiring position information of the identifiable text character in the text to be identified;
screening out a second adjacent text character capable of identifying the text character from the text to be identified according to the position information;
and extracting the characteristics of the second adjacent text characters to obtain the text characteristics of the identifiable text characters.
7. The text recognition method of claim 6, wherein the feature extracting the second adjacent text character to obtain the text feature of the recognizable text character comprises:
extracting features of the identifiable text characters to obtain initial text features of the identifiable text characters;
determining text features of the second adjacent text characters based on the initial text features of the recognizable text characters;
And adjusting the initial text characteristics of the identifiable text characters based on the text characteristics of the second adjacent text characters to obtain the text characteristics of the identifiable text characters.
8. The method for recognizing text according to claim 5, wherein the fusing text features of the recognizable text characters to obtain text features of the text to be recognized comprises:
fusing the text features of the identifiable text characters to obtain first initial text features of the text to be identified;
screening text characters which are not repeated from the identifiable text characters;
performing feature fusion on the text features of the text characters which are not repeated to obtain second initial text features of the text to be recognized;
and splicing the first initial text feature and the second initial text feature to obtain the text feature of the text to be identified.
9. The text recognition method according to claim 4, wherein the recognizing the text to be recognized according to the text characteristics of the text to be recognized includes:
calculating the similarity between the text features of the text to be identified and the text features in a preset text feature library;
And identifying the text to be identified according to the similarity.
10. The text recognition method of claim 9, wherein the pre-set text feature library comprises a plurality of sub-text feature libraries, and the calculating the similarity between the text features of the text to be recognized and the text features in the pre-set text feature library comprises:
segmenting the text features of the text to be identified to obtain a plurality of sub-text features;
clustering a target sub-text feature library corresponding to the sub-text feature in the preset text feature library;
calculating initial similarity between the sub-text features and text features in the target sub-text feature library;
and fusing the initial similarity to obtain the similarity between the text features of the text to be identified and the text features in a preset text feature library.
11. The text recognition method according to claim 9, wherein the recognizing the text to be recognized according to the similarity includes:
screening a preset number of target preset text features from the preset text feature library according to the similarity, wherein the preset text feature library comprises preset normal text features and preset junk text features;
When all the target preset text features are preset normal text features, determining that the text to be recognized is a normal text;
when all the target preset text features are preset junk text features, determining that the text to be identified is junk text;
when the preset normal text features and the preset junk text features exist in the target preset text features, normal text similarity and junk text similarity are selected from the similarity, the text to be identified is identified according to the normal text similarity and the junk text similarity, the normal text similarity is the similarity between the text features of the text to be identified and the preset normal text features, and the junk text similarity is the similarity between the text features of the text to be identified and the preset junk text features.
12. The text recognition method according to claim 11, wherein the selecting the normal text similarity and the junk text similarity from the similarities, and recognizing the text to be recognized according to the normal text similarity and the junk text similarity, includes:
screening out normal text similarity and junk text similarity from the similarity;
Respectively weighting the normal text similarity and the junk text similarity to obtain a first weighted value of the normal text similarity and a second weighted value of the junk text similarity;
when the first weighted value exceeds the second weighted value, determining that the text to be recognized is a normal text;
and when the first weighted value does not exceed the second weighted value, determining that the text to be recognized is junk text.
13. The text recognition method of claim 12, further comprising:
and when the text to be identified is the junk text, intercepting the text to be identified.
14. A text recognition device, comprising:
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a text to be identified and font information used by the text to be identified in a service platform, and the text to be identified comprises a plurality of text characters;
the conversion unit is used for screening text characters which do not belong to a preset text character library from the text to be identified to obtain special characters, and converting the special characters into images according to the fonts to obtain character images;
the screening unit is used for identifying the character images by adopting an image identification model so as to screen candidate text characters similar to the character images from the preset text character library, the image identification model is trained by a plurality of character image samples, and the character image samples are images formed by converting text characters in the preset text character library according to different font information;
A determining unit, configured to screen, when the number of candidate text characters that are similar to the character image that are screened out is a plurality of, a first adjacent text character of the special character in the text to be identified according to the context information of the special character in the text to be identified; determining association information of the first adjacent text character and the candidate text character; determining a target text character corresponding to the special character from the candidate text characters according to the association information;
the recognition unit is used for recognizing the text to be recognized based on the target text characters to obtain the text type of the text to be recognized, wherein the text type comprises a normal text and a junk text.
15. An electronic device comprising a memory and a processor; the memory stores an application program, and the processor is configured to execute the application program in the memory to perform the steps in the text recognition method of any one of claims 1 to 13.
16. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps in the text recognition method of any of claims 1 to 13.
CN202010298644.9A 2020-04-16 2020-04-16 Text recognition method and device Active CN111507350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010298644.9A CN111507350B (en) 2020-04-16 2020-04-16 Text recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010298644.9A CN111507350B (en) 2020-04-16 2020-04-16 Text recognition method and device

Publications (2)

Publication Number Publication Date
CN111507350A CN111507350A (en) 2020-08-07
CN111507350B true CN111507350B (en) 2024-01-05

Family

ID=71864182

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010298644.9A Active CN111507350B (en) 2020-04-16 2020-04-16 Text recognition method and device

Country Status (1)

Country Link
CN (1) CN111507350B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329769A (en) * 2020-10-27 2021-02-05 广汽本田汽车有限公司 Vehicle nameplate identification method and device, computer equipment and storage medium
CN112464180A (en) * 2020-11-25 2021-03-09 平安信托有限责任公司 Page screenshot outgoing control method and system, electronic device and storage medium
CN113221752A (en) * 2021-05-13 2021-08-06 北京惠朗时代科技有限公司 Multi-template matching-based multi-scale character accurate identification method
CN113962199B (en) * 2021-12-20 2022-04-08 腾讯科技(深圳)有限公司 Text recognition method, text recognition device, text recognition equipment, storage medium and program product
CN114529930B (en) * 2022-01-13 2024-03-01 上海森亿医疗科技有限公司 PDF restoration method, storage medium and device based on nonstandard mapping fonts
CN115545009B (en) * 2022-12-01 2023-07-07 中科雨辰科技有限公司 Data processing system for acquiring target text

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477544A (en) * 2009-01-12 2009-07-08 腾讯科技(深圳)有限公司 Rubbish text recognition method and system
CN101976253A (en) * 2010-10-27 2011-02-16 重庆邮电大学 Chinese variation text matching recognition method
CN103514174A (en) * 2012-06-18 2014-01-15 北京百度网讯科技有限公司 Text categorization method and device
CN103902993A (en) * 2012-12-28 2014-07-02 佳能株式会社 Document image identification method and device
CN106682666A (en) * 2016-12-29 2017-05-17 成都数联铭品科技有限公司 Characteristic template manufacturing method for unusual font OCR identification
CN109614610A (en) * 2018-11-27 2019-04-12 新华三大数据技术有限公司 Similar Text recognition methods and device
CN110263781A (en) * 2018-03-12 2019-09-20 精工爱普生株式会社 Image processing apparatus, image processing method and storage medium
CN110472234A (en) * 2019-07-19 2019-11-19 平安科技(深圳)有限公司 Sensitive text recognition method, device, medium and computer equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9058644B2 (en) * 2013-03-13 2015-06-16 Amazon Technologies, Inc. Local image enhancement for text recognition

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477544A (en) * 2009-01-12 2009-07-08 腾讯科技(深圳)有限公司 Rubbish text recognition method and system
CN101976253A (en) * 2010-10-27 2011-02-16 重庆邮电大学 Chinese variation text matching recognition method
CN103514174A (en) * 2012-06-18 2014-01-15 北京百度网讯科技有限公司 Text categorization method and device
CN103902993A (en) * 2012-12-28 2014-07-02 佳能株式会社 Document image identification method and device
CN106682666A (en) * 2016-12-29 2017-05-17 成都数联铭品科技有限公司 Characteristic template manufacturing method for unusual font OCR identification
CN110263781A (en) * 2018-03-12 2019-09-20 精工爱普生株式会社 Image processing apparatus, image processing method and storage medium
CN109614610A (en) * 2018-11-27 2019-04-12 新华三大数据技术有限公司 Similar Text recognition methods and device
CN110472234A (en) * 2019-07-19 2019-11-19 平安科技(深圳)有限公司 Sensitive text recognition method, device, medium and computer equipment

Also Published As

Publication number Publication date
CN111507350A (en) 2020-08-07

Similar Documents

Publication Publication Date Title
CN111507350B (en) Text recognition method and device
CN110580292B (en) Text label generation method, device and computer readable storage medium
AU2020327704B2 (en) Classification of data using aggregated information from multiple classification modules
CN111666502A (en) Abnormal user identification method and device based on deep learning and storage medium
US20200004815A1 (en) Text entity detection and recognition from images
CN107145516B (en) Text clustering method and system
CN110598019B (en) Repeated image identification method and device
CN112559747B (en) Event classification processing method, device, electronic equipment and storage medium
CN114896305A (en) Smart internet security platform based on big data technology
Lin et al. Rumor detection with hierarchical recurrent convolutional neural network
CN112836509A (en) Expert system knowledge base construction method and system
CN111401063B (en) Text processing method and device based on multi-pool network and related equipment
CN113961666B (en) Keyword recognition method, apparatus, device, medium, and computer program product
CN116186268A (en) Multi-document abstract extraction method and system based on Capsule-BiGRU network and event automatic classification
CN113128526B (en) Image recognition method and device, electronic equipment and computer-readable storage medium
CN112989058B (en) Information classification method, test question classification method, device, server and storage medium
CN114817633A (en) Video classification method, device, equipment and storage medium
JP2011128924A (en) Comic image analysis apparatus, program, and search apparatus and method for extracting text from comic image
Putra et al. Hate speech detection using convolutional neural network algorithm based on image
CN114443904A (en) Video query method, video query device, computer equipment and computer readable storage medium
CN113869068A (en) Scene service recommendation method, device, equipment and storage medium
CN115130453A (en) Interactive information generation method and device
CN112632229A (en) Text clustering method and device
CN114880572B (en) Intelligent news client recommendation system
CN116612501B (en) Object recognition method, device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40029151

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant