CN109190615B - Shape-near word recognition determination method, device, computer device and storage medium - Google Patents

Shape-near word recognition determination method, device, computer device and storage medium Download PDF

Info

Publication number
CN109190615B
CN109190615B CN201810834750.7A CN201810834750A CN109190615B CN 109190615 B CN109190615 B CN 109190615B CN 201810834750 A CN201810834750 A CN 201810834750A CN 109190615 B CN109190615 B CN 109190615B
Authority
CN
China
Prior art keywords
stroke order
image feature
image
stroke
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810834750.7A
Other languages
Chinese (zh)
Other versions
CN109190615A (en
Inventor
徐庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan Guofang Software Technology Co ltd
Xu Qing
Foshan Guofang Identification Technology Co Ltd
Original Assignee
Foshan Guofang Trademark Identification Technology Co ltd
Foshan Guofang Trademark Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan Guofang Trademark Identification Technology Co ltd, Foshan Guofang Trademark Service Co ltd filed Critical Foshan Guofang Trademark Identification Technology Co ltd
Priority to CN201810834750.7A priority Critical patent/CN109190615B/en
Publication of CN109190615A publication Critical patent/CN109190615A/en
Application granted granted Critical
Publication of CN109190615B publication Critical patent/CN109190615B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/751Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

The invention relates to a shape and near word recognition and judgment method, a device, computer equipment and a storage medium; the shape and proximity character recognition and determination method comprises the following steps: identifying the input elements, extracting the associated information of the input elements, and acquiring characters corresponding to the input elements; performing stroke order coding on characters corresponding to the input elements to obtain an integral stroke order combination unit; performing stroke order word segmentation processing on the whole stroke order combination unit to obtain a local stroke order combination unit; searching a sample image database by taking the whole stroke order combination unit and the local stroke order combination unit as keywords to obtain stroke order shape near characters and an image characteristic approximation rate; and confirming the stroke order shape near characters with the image characteristic approximation rate meeting the application requirement as the shape near characters of the input elements. The invention can improve the recall ratio and precision ratio of the shape and word retrieval and improve the identification accuracy.

Description

Shape-near word recognition determination method, device, computer device and storage medium
Technical Field
The present application relates to the field of image recognition technologies, and in particular, to a method and an apparatus for recognizing and determining a shape and a proximity word, a computer device, and a storage medium.
Background
The shape-similar characters refer to Chinese characters similar in the character form structure, the shape-similar characters appear in groups, a single Chinese character is not the shape-similar character, and the shape-similar character can be called only when the characters of the two parties are similar in the character form structure when the characters of the two parties are compared with other Chinese characters. The judgment of the form-word is subjective to a certain extent, and is influenced by various factors, the results of subjective judgment of different people are possibly inconsistent, and especially when a comparison object of a tool body is unknown, how to accurately find and judge the form-word is always a difficult problem of form-word information retrieval.
At present, the shape and near characters are obtained mainly by manually establishing a shape and near character library, and in the implementation process, the inventor finds that at least the following problems exist in the traditional technology: form-word missing detection is easily generated in form-word retrieval, for example, the current form-word library cannot meet the judgment requirement of dynamic writing form-word.
Disclosure of Invention
In view of the above, it is desirable to provide a method, an apparatus, a computer device, and a storage medium for identifying and determining a near-shape word, which can effectively overcome the problem that a near-shape word is likely to miss in a near-shape word search.
In order to achieve the above object, in one aspect, an embodiment of the present invention provides a method for determining near-word shape recognition, including:
identifying the input elements, extracting the associated information of the input elements, and acquiring characters corresponding to the input elements; the input element is an input image or an image obtained by performing image conversion on input characters according to a preset writing font; the associated information of the input element comprises an image feature descriptor, an image feature descriptor minimum unit and combination unit data;
performing stroke order coding on characters corresponding to the input elements to obtain an integral stroke order combination unit;
performing stroke order word segmentation processing on the whole stroke order combination unit to obtain a local stroke order combination unit; the stroke order word segmentation processing comprises segmentation processing according to a preset segmentation rule and combination processing according to a preset combination rule;
searching a sample image database by taking the whole stroke order combination unit and the local stroke order combination unit as keywords to obtain matched sample characters, and acquiring associated information of a sample image corresponding to the sample characters; the associated information of the sample image comprises an image feature descriptor, an image feature descriptor minimum unit and combination unit data of the sample image;
confirming the sample characters as stroke order-shaped near characters of the input elements, and confirming the associated information of the sample images as the associated information of the stroke order-shaped near characters; comparing the image characteristics of the input elements and the sample image according to the associated information of the input elements and the associated information of the stroke order-shaped similar words to obtain an image characteristic approximation rate of the stroke order-shaped similar words;
and confirming the stroke order shape near characters with the image characteristic approximation rate meeting the application requirement as the shape near characters of the input elements.
On one hand, an embodiment of the present invention further provides a device for identifying and determining a shape near word, including:
the recognition extraction module is used for recognizing the input elements, extracting the associated information of the input elements and acquiring characters corresponding to the input elements; the input element is an input image or an image obtained by performing image conversion on input characters according to a preset writing font; the associated information of the input element comprises an image feature descriptor, an image feature descriptor minimum unit and combination unit data;
the coding module is used for coding the stroke order of the characters corresponding to the input elements to obtain an integral stroke order combination unit;
the word segmentation module is used for carrying out stroke order word segmentation processing on the whole stroke order combination unit to obtain a local stroke order combination unit; the stroke order word segmentation processing comprises segmentation processing according to a preset segmentation rule and combination processing according to a preset combination rule;
the retrieval module is used for retrieving the sample image database by taking the whole stroke order combination unit and the local stroke order combination unit as keywords, obtaining matched sample characters and acquiring the associated information of the sample image corresponding to the sample characters; the associated information of the sample image comprises an image feature descriptor, an image feature descriptor minimum unit and combination unit data of the sample image;
the image characteristic approximation rate acquisition module is used for determining the sample characters as stroke order-shaped near characters of the input elements and determining the associated information of the sample images as the associated information of the stroke order-shaped near characters; comparing the image characteristics of the input elements and the sample image according to the associated information of the input elements and the associated information of the stroke order-shaped similar words to obtain an image characteristic approximation rate of the stroke order-shaped similar words;
and the selecting module is used for confirming the stroke order shape near characters with the image characteristic approximation rate meeting the application requirements as the shape near characters of the input elements.
In one aspect, an embodiment of the present invention provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the above-mentioned shape-approximate-word recognition and determination method when executing the computer program.
In another aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the above-mentioned method for identifying and determining near-form words.
One of the above technical solutions has the following advantages and beneficial effects:
based on massive knowledge data information (such as a sample image database and the like), searching the sample image database by taking the whole stroke order combination unit and the local stroke order combination unit as search keywords, acquiring matched stroke order-shaped near characters, calculating the image feature approximation rate of input elements and the searched stroke order-shaped near characters, and analyzing and judging the combination of stroke order codes and the image features of characters so as to realize that the characters corresponding to the stroke order-shaped near characters combined by the stroke order codes and the image features of the characters are presumed to be the shape near characters of the input elements; the method and the device have the advantages that the recognition and the judgment of the shape and the proximity of the dynamically input characters or character graphs are carried out, the difficult problems of the recognition and the judgment of the shape and the proximity of the dynamically written characters are effectively solved, the defects that the missing detection of the shape and the proximity of the characters is easily generated in the retrieval of the shape and the proximity of the characters in the traditional technical method, and the static shape and proximity character library adopted for the dynamically written characters cannot meet the requirements of the judgment of the shape and the proximity of the characters are overcome, the information of the shape and the proximity of the characters can be estimated and recognized through the associated information (such as image characteristic descriptors) of big data, the effect of the retrieval of the shape and the proximity of the characters is improved, the recall rate and the precision rate of the retrieval of the shape and the proximity of the characters are improved, and the accuracy rate of the recognition of the characters is improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a diagram of an exemplary implementation of a method for determining approximate word recognition;
FIG. 2 is a first schematic flow chart diagram illustrating a method for approximate word recognition determination in one embodiment;
FIG. 3 is a second schematic flow chart diagram illustrating a method for approximate word recognition determination in one embodiment;
FIG. 4 is a first exemplary text picture for obtaining an approximation rate of an image feature in one embodiment;
FIG. 5 is a second exemplary text picture for obtaining an approximation rate of an image feature in one embodiment;
FIG. 6 is a block diagram showing the structure of a font character recognition determining apparatus according to an embodiment;
FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The recognition of the shape of the traditional technology to the character is mostly a manual collection mode, the workload is large, and the time and the labor are wasted. When the characters to be compared are all printed, the recognition of the shape-similar characters can be realized according to the established shape-similar character library, but when the characters to be compared are cursive script or handwritten script, the established shape-similar character library cannot cover all cursive script or handwritten script, and the recognition and judgment of the shape-similar characters are a difficult point.
When a character is written in a standard printing form, the character may not form a character with a similar shape, but when the character is written in a cursive script or a handwritten form, the characters with different shapes may form a character with a similar shape, and the characters with different shapes in the standard writing may not form a character with a similar shape. As can be appreciated, the conventional techniques have at least the following drawbacks: in the form and word retrieval, form and word missing detection is easy to generate, the writing of the characters is dynamic, and the static form and word library can not meet the requirement of form and word judgment.
The method for identifying and judging the shape-similar words can be applied to the application environment shown in fig. 1. The terminal 102 may communicate with the server 104 through a network, so as to obtain input characters to be recognized, input images converted from the input characters, sample images, and related data related to a sample image database, it should be noted that the terminal 102 may not communicate with the server 104, and the terminal 102 may store the related data in advance or obtain the related data in real time, and then perform processing such as recognition and determination; the terminal can also transmit the data such as the input characters acquired in real time to the server 104, and the server 104 can further perform identification and judgment processing on the data; the terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a method for determining near-word shape recognition is provided, which is described by taking the method as an example applied to the terminal in fig. 1, and includes the following steps:
step 210, identifying the input element, extracting the associated information of the input element, and acquiring characters corresponding to the input element;
the input element is an input image or an image obtained by performing image conversion on input characters according to a preset writing font; the associated information of the input element includes an image feature descriptor, an image feature descriptor minimum unit, and combination unit data.
Related information of the input elements, such as image feature descriptors, image feature descriptor minimum unit and combination unit data, can be extracted by using a prior art method.
Specifically, when the input element is in a text form, image conversion needs to be performed on the input text, which is actually the reverse form of OCR, that is, the digital form of the input text of the machine-editable text is converted into a text image form, so that the input text is fixed in a specific shape, and an image obtained by converting the input text according to a preset specific writing font is obtained. So that the image feature descriptor, the image feature descriptor minimum unit, and the combined unit data information thereof of the image can be extracted based on the specific shape. In one particular example, the preset writing fonts may include a song style, a black body, and various fonts currently known.
The image feature descriptor is an image feature representation form which can record the same perception content or feature in the image by using the same or highly similar character strings and record the different perception content or feature in the image by using different character strings. Further, the image feature representation form may be a set of one or more sets of character strings describing the image features of the image to be processed, that is, the image feature descriptor is a set of one or more sets of character strings describing the image features. It should be noted that the image feature descriptor of the image to be processed can be extracted by using the prior art method.
And the character string of the image feature descriptor is generally used to represent feature points of the image, and one or more character strings corresponding to each feature point may be referred to as an image feature descriptor minimum unit.
Specifically, the image feature descriptor generally describes a plurality of image feature points, and thus the minimum unit of the image feature descriptor may be a plurality of points. The process of segmenting the image feature descriptor of the image to be processed may be: and dividing each image feature point represented by the image feature descriptor, and regarding each character string or a plurality of character strings corresponding to each image feature point of the image feature descriptor as the minimum unit of the image feature descriptor.
In a specific example, the image feature descriptor is a feature descriptor used for representing the corresponding relation between position data of any pixel point of an image contour line or an image skeleton line and a coordinate region of a standard coordinate system of any specification; the minimum unit of the image feature descriptor is the position data of one or more pixel points of an image contour line or an image skeleton line corresponding to any coordinate region of a standard coordinate system with any specification.
Furthermore, the minimum unit of the isolated image feature descriptor may not have practical application significance, and each minimum unit is combined according to a preset minimum unit combination rule to obtain a combination unit of the minimum unit, so that the combined minimum unit combination of the image feature descriptor has a specific significance. The preset image feature descriptor minimum unit combination rule can be established according to the application requirement.
It should be noted that the image feature descriptor minimum unit combination data in the present application may be used to represent a connected domain combination unit data, a line segment combination unit data, or a character string data for performing storage processing.
In a specific embodiment, before the steps of identifying the input element, extracting the association information of the input element, and obtaining the text corresponding to the input element, the method may further include the steps of:
and establishing a sample image database.
In a specific embodiment, the step of establishing the sample image database includes:
carrying out image feature descriptor segmentation processing on the sample image to obtain each image feature descriptor minimum unit of the sample image; the minimum unit of the image feature descriptor is one or more character strings corresponding to any image feature point represented by the image feature descriptor;
combining the minimum units of the image feature descriptors according to a preset minimum unit combination rule to obtain data of each combination unit of the sample image;
and
and acquiring an integral stroke order combination unit and a local stroke order combination unit of the sample characters corresponding to the sample image.
It should be noted that the sample image database is used for recording and storing each sample text and the associated information of the sample image thereof, wherein the sample image includes: the Chinese characters comprise patterns formed by each Chinese character under various fonts, patterns formed by each non-Chinese character under various fonts, trademark patterns with each character meaning, appearance design patterns with each character meaning, artwork patterns registered by copyright with each character meaning, and user preset customized images. The sample characters comprise Chinese characters and non-Chinese characters.
The sample image database comprises sample characters, an integral stroke order combination unit and a local stroke order combination unit of the sample characters and sample images corresponding to the sample characters; the sample image database also comprises sample characters corresponding to the sample images, and image feature descriptors, minimum units of the image feature descriptors and combined unit data of the sample images;
specifically, for the case of an image obtained by converting an input element into an input text according to a preset specific writing font, the image feature descriptor, the minimum unit of the image feature descriptor, and the data information of the combination unit thereof (as defined above) may be obtained based on the prior art;
furthermore, the following may also be employed:
firstly, searching a sample image database by taking input characters as key words to obtain matched sample characters; secondly, finding out the associated information corresponding to the matched sample characters, comprising: sample images, image feature descriptors, minimum units of image feature descriptors and combined unit data formed by various writing fonts; thirdly, the obtained associated information of the sample characters is used as the associated information of the image.
In practical applications, the above related information data information is generally known and massive, and constitutes huge sample image big data, and all of the data can be sample image data according to the present application.
Generally, when the input element is in an image form (i.e., the input element is an input image), the input image may be subjected to text conversion, and the conversion from a drawing to a text may be realized by using a common OCR recognition technology, so as to directly obtain the text corresponding to the input element.
In addition, when the input element is in an image form, the characters corresponding to the input element can be acquired by the following method:
firstly, searching a sample image database by using image feature descriptors of an input image, minimum units of the image feature descriptors and data of a combined unit as keywords to obtain matched sample characters; secondly, the sample character with the highest matching degree is used as the input character corresponding to the input image.
And step 220, performing stroke order coding on the characters corresponding to the input elements to obtain an integral stroke order combination unit.
Specifically, the method comprises the steps of coding the stroke order of characters to obtain an integral stroke order combination unit, wherein the stroke order coding process is to divide strokes written by Chinese characters into 5 types of horizontal stroke, vertical stroke, left falling stroke, right falling stroke and turning stroke which are respectively expressed by 1, 2, 3, 4 and 5 or letters or symbols, and stroke order character strings formed by coding according to the stroke writing order of the Chinese characters can be called as the integral stroke order combination unit; the stroke order can refer to the stroke writing sequence of the characters; therein, strokes generally refer to the various points and lines of a particular shape that make up a word without discontinuities.
In a specific embodiment, the whole stroke order combination unit is a stroke order character string formed by encoding according to the standard stroke writing sequence; the stroke order character string may be a stroke order numeric string, a stroke order alphabetic string, or a stroke order symbol string.
Step 230, performing stroke order word segmentation processing on the whole stroke order combination unit to obtain a local stroke order combination unit; the stroke order word segmentation processing comprises segmentation processing according to a preset segmentation rule and combination processing according to a preset combination rule.
Specifically, the method comprises the steps of performing stroke order word segmentation processing on a stroke order character string to obtain a local stroke order combination unit which is segmented and combined according to a preset segmentation rule; and performing stroke order word segmentation processing on the stroke order character string may include: and carrying out segmentation processing on the strokes of the minimum continuous stroke unit of the stroke order character string and carrying out combination processing on the stroke codes of the minimum continuous stroke unit.
In a specific embodiment, a specific method of segmentation processing according to a preset segmentation rule is as follows: establishing a preset segmentation rule, comprising the following steps: taking stroke order codes corresponding to strokes of the minimum continuous stroke unit in the whole stroke order combination unit as segmentation units; and identifying each stroke order code in the stroke order character string, dividing each stroke order code, and recording the sequence of the stroke order codes.
The step of performing the combining process according to the preset combining rule may include: combining the stroke codes of the minimum continuous stroke unit according to a preset combination rule to obtain a local stroke order combination unit; the local stroke order combination unit comprises an integral combination unit and a local part combination unit;
in a specific embodiment, the preset combination rule includes: the preset combination rule comprises an integral combination unit for confirming the integral stroke order code of any character in the integral stroke order combination unit as the character, and a local component combination unit for confirming the combination of the preset stroke number of any character in the integral stroke order combination unit as the character. Further, the value range of the preset stroke number is greater than or equal to 2.
Step 240, taking the whole stroke order combination unit and the local stroke order combination unit as keywords, searching a sample image database to obtain matched sample characters, and acquiring associated information of a sample image corresponding to the sample characters; the associated information of the sample image comprises an image feature descriptor, an image feature descriptor minimum unit and combination unit data of the sample image;
specifically, the whole stroke order combination unit and the local stroke order combination unit are used as search keywords to search the sample image database, obtain matched sample characters and associated information of sample images corresponding to the sample characters, and regard the sample characters as stroke order-shaped near characters formed by the input characters; regarding the associated information of the sample image as the associated information of the stroke order-shaped similar word;
matching means that the search keyword is the same as the whole stroke order combination unit and the local stroke order combination unit recorded in the sample image database. And the whole stroke order combination unit and the local stroke order combination unit are used as retrieval keywords, and after the sample image database is retrieved, records of the whole stroke order combination unit and the local stroke order combination unit which are matched are obtained. The associated information of the sample image corresponding to the sample text can be obtained according to the records, and the associated information comprises the following steps: image feature descriptors of the sample image, image feature descriptor minimum units, and combined unit data information thereof.
In the sample image database, through the processing of the above steps, the sample image corresponding to the whole stroke order combination unit or the local stroke order combination unit, the corresponding sample text, the image feature descriptor of the sample image, the image feature descriptor minimum unit and the combination unit data information thereof are recorded. After the sample image database is searched, the related information corresponding to the stroke order unit can be indirectly obtained.
Step 250, confirming the sample characters as stroke order-shaped near characters of the input elements, and confirming the associated information of the sample images as the associated information of the stroke order-shaped near characters; and comparing the image characteristics of the input elements and the sample image according to the associated information of the input elements and the associated information of the stroke order-shaped similar words to obtain the image characteristic approximation rate of the stroke order-shaped similar words.
In a specific embodiment, the step of comparing the image features of the input element and the sample image according to the association information of the input element and the association information of the stroke-order-shaped near word to obtain the image feature approximation rate of the stroke-order-shaped near word includes:
acquiring the matching rate of the minimum unit of the image feature descriptor and the mismatching rate of the minimum unit of the image feature descriptor of the input element and the stroke order shape similar word;
and obtaining an image feature approximation rate according to the minimum unit matching rate of the image feature descriptors and the minimum unit mismatching rate of the image feature descriptors.
Specifically, the implementation process of obtaining the image feature approximation rate may include: calculating and obtaining the minimum unit matching rate of the image feature descriptor and the minimum unit mismatching rate of the image feature descriptor of each sample image of the sample characters forming the stroke order shape similar characters matched with each retrieval and the input elements; and obtaining an image feature approximation rate according to the minimum unit matching rate of the image feature descriptors and the minimum unit mismatching rate of the image feature descriptors.
The image feature approximation rate is the ratio of the image feature descriptor minimum unit matching rate minus the image feature descriptor minimum unit mismatching rate. The matching rate of the minimum unit of the image feature descriptor is the matching rate of the minimum unit of the image feature descriptor of the input element and the minimum unit of the image feature descriptor of the stroke-order-shaped similar word; the image feature descriptor minimum unit mismatch rate refers to a rate at which the image feature descriptor minimum unit of the input element does not match the image feature descriptor minimum unit of the stroke shape similar word.
In practical applications, on the one hand, characters with the same stroke order do not necessarily form similar characters. On the other hand, the conventional shape and word judgment is only limited to the same type of words. According to the method, the minimum unit matching rate of the image characteristic descriptors and the minimum unit mismatching rate of the image characteristic descriptors of the sample image corresponding to the searched sample characters of the input elements are calculated, and effective information support is provided for judging whether the two characters form the similar characters or not, so that missing detection is prevented from being generated in the search of the similar characters.
In a specific embodiment, the step of obtaining the minimum unit matching rate of the image feature descriptor and the minimum unit non-matching rate of the image feature descriptor of the input element and the stroke order shape near word comprises:
acquiring the total number of minimum units of image feature descriptors of input elements, the sum of minimum units of image feature descriptors of stroke order-shaped characters which are matched with the input elements, and the sum of minimum units of image feature descriptors of stroke order-shaped characters which are not matched with the input elements;
obtaining the minimum unit matching rate of the image feature descriptors based on the following formula:
Ma=(Ua÷U0)×100%
wherein M isaRepresenting the minimum unit match rate, U, of image feature descriptors0Total number of minimum units, U, of image feature descriptors representing input elementsaRepresenting the minimum unit sum of image feature descriptors of stroke order-shaped near-word matching input elements;
obtaining the image feature descriptor minimum unit mismatching rate based on the following formula:
Mi=(Uc÷U0)×100%+(n-1)×ω
wherein M isiRepresenting image feature descriptor minimum unit mismatch rate, U0Total number of minimum units, U, of image feature descriptors representing input elementscThe minimum unit sum of the image feature descriptors, which indicates that the stroke order-shaped near characters are not matched with the input elements, is n, which indicates the number of positions, which are not matched with the input elements, on the minimum unit combination connecting line of the image feature lines, of the stroke order-shaped near characters, and omega indicates the weight of the positions; wherein, the value range of omega is less than or equal to 90 percent.
It should be noted that the image feature descriptor minimum unit combination connecting line refers to an image feature line.
In a specific embodiment, the step of obtaining the image feature approximation rate according to the image feature descriptor minimum unit matching rate and the image feature descriptor minimum unit mismatching rate includes:
based on the following formula, the image feature approximation rate is obtained:
M=Ma-Mi×β
where M represents an image feature approximation rate, MaRepresenting the minimum unit match rate, M, of image feature descriptorsiRepresents the image feature descriptor minimum unit mismatch rate, beta represents MiThe weight of (a); wherein the value range of beta is less than or equal to 90%.
And step 260, confirming the stroke order shape near characters with the image characteristic approximation rate meeting the application requirements as the shape near characters of the input elements.
Specifically, after the foregoing steps, the search results may be sorted according to the image feature approximation rate from large to small, and the stroke order-shaped near characters with higher image feature approximation rate have more opportunity to form the near characters. In practical application, stroke order shape near characters with the image feature approximation rate meeting application requirements can be selected and estimated as shape near characters of input elements.
In a specific embodiment, the step of confirming the stroke order-shaped near word with the image feature approximation rate meeting the application requirement as the shape-shaped near word of the input element further comprises the following steps:
selecting a stroke order-shaped near character with the image feature descriptor minimum unit matching rate larger than the matching rate threshold value and the image feature descriptor minimum unit mismatching rate smaller than the mismatching rate threshold value;
the step of confirming the stroke order shape near character with the image characteristic approximation rate meeting the application requirement as the shape near character of the input element comprises the following steps:
and sequencing the stroke order form characters according to the image characteristic approximation rate, and confirming the characters corresponding to the stroke order form characters meeting the preset sequencing ranking as the form characters of the input elements.
In a specific embodiment, the matching rate threshold is 30%; the mismatch threshold is 70%; the predetermined rank is less than 300.
In the above-mentioned shape-near word recognition and determination method, based on a large amount of knowledge data information (for example, a sample image database, etc.), a sample image database is searched by using an entire stroke order combination unit and a local stroke order combination unit as search keywords, the sample image database is searched to obtain sample characters matching the stroke order combination unit and associated information of the sample characters, the sample characters are regarded as stroke order shape-near words forming the input characters, image feature approximation rate evaluation processing is performed on the stroke order shape-near words to obtain the image feature approximation rate of each searched sample character forming the stroke order shape-near words and the input characters, the stroke order shape-near words with the image feature approximation rate meeting the application requirements are selected, the characters corresponding to the selected stroke order shape-near words are confirmed as shape-near words of the input characters, and analysis and determination of combination of stroke order codes and image features of the characters are realized, so that the characters corresponding to the stroke order shape-near words combining the stroke order codes and the image features of the characters are taken as input presumption characters The shape of the element is similar to the character;
the method and the device have the advantages that the recognition and the judgment of the shape and the proximity of the dynamically input characters or character graphs are carried out, the difficult problems of the recognition and the judgment of the shape and the proximity of the dynamically written characters are effectively solved, the defects that the missing detection of the shape and the proximity of the characters is easily generated in the retrieval of the shape and the proximity of the characters in the traditional technical method, and the static shape and proximity character library adopted for the dynamically written characters cannot meet the requirements of the judgment of the shape and the proximity of the characters are overcome, the information of the shape and the proximity of the characters can be estimated and recognized through the associated information (such as image characteristic descriptors) of big data, the effect of the retrieval of the shape and the proximity of the characters is improved, the recall rate and the precision rate of the retrieval of the shape and the proximity of the characters are improved, and the accuracy rate of the recognition of the characters is improved.
In one embodiment, as shown in fig. 3, a method for determining near-word recognition is provided, and the present embodiment is embodied on the basis of the above embodiments. Taking the application of the method to the terminal in fig. 1 as an example, the method includes the following steps:
step S310, establishing a sample image database;
step S320, identifying the input element, extracting the associated information of the input element, and acquiring the characters corresponding to the input element; the input element is an input image or an image obtained by performing image conversion on input characters according to a preset writing font; the associated information of the input element comprises an image feature descriptor, an image feature descriptor minimum unit and combination unit data;
step S330, the order of strokes of the characters corresponding to the input elements is coded to obtain an integral order of strokes combination unit;
step S340, performing stroke order word segmentation processing on the whole stroke order combination unit to obtain a local stroke order combination unit; the stroke order word segmentation processing comprises segmentation processing according to a preset segmentation rule and combination processing according to a preset combination rule;
step S350, the whole stroke order combination unit and the local stroke order combination unit are used as keywords, a sample image database is searched, matched sample characters are obtained, and associated information of a sample image corresponding to the sample characters is obtained; the associated information of the sample image comprises an image feature descriptor, an image feature descriptor minimum unit and combination unit data of the sample image;
step S360, confirming the sample characters as stroke order-shaped near characters of the input elements, and confirming the associated information of the sample images as the associated information of the stroke order-shaped near characters; comparing the image characteristics of the input elements and the sample image according to the associated information of the input elements and the associated information of the stroke order-shaped similar words to obtain an image characteristic approximation rate of the stroke order-shaped similar words;
and step S370, confirming the stroke order shape near characters with the image characteristic approximation rate meeting the application requirements as the shape near characters of the input elements.
Specifically, when the sample image database is established in step S310, the data information of the text and image feature descriptors, the minimum unit of the image feature descriptors, and the combined unit thereof corresponding to the sample image may also be recorded; wherein the sample image includes: the pattern of each Chinese character under various fonts, the pattern of each non-Chinese character under various fonts, the trademark pattern with each character meaning, the appearance design pattern with each character meaning, the artwork pattern registered by the copyright with each character meaning, and the image preset and defined by the user;
the sample text may include: chinese characters, non-chinese characters;
the specific implementation of steps S320 to S370 can refer to the specific description of steps S210 to S260.
As shown in fig. 3, the present embodiment will be described with reference to a specific example.
Step 1, establishing a sample image database, and extracting and recording sample characters corresponding to a sample image, and data information of an image feature descriptor, an image feature descriptor minimum unit and a combination unit of the image feature descriptor minimum unit of the sample image; extracting and recording a sample image corresponding to the sample characters, an integral stroke order combination unit and a local stroke order combination unit of the sample characters;
in practical applications, the above data information is generally known and massive, and constitutes huge sample image big data, and all of the data can be sample image data according to the application.
And 2, performing image conversion on the input characters according to a preset specific writing font to obtain an image corresponding to the input characters, and extracting the image feature descriptors of the image, the minimum unit of the image feature descriptors and the data information of the combined unit of the minimum unit of the image feature descriptors. Or the like, or, alternatively,
and directly extracting data information of the image feature descriptor, the minimum unit of the image feature descriptor and the combined unit of the minimum unit of the image feature descriptor of the input image.
And 3, performing stroke order coding on input characters (or characters corresponding to the input image) to obtain an integral stroke order combination unit, wherein the stroke order coding is to divide strokes written by Chinese characters into 5 types of horizontal, vertical, left falling, right falling and turning strokes which are respectively expressed by 1, 2, 3, 4 and 5 or letters or symbols, and the strokes are coded according to the stroke writing order of Chinese characters to form a stroke order character string, and the stroke order character string comprises: stroke order numeric string or stroke order letter string or stroke order symbol string;
the stroke order refers to the writing sequence of strokes of a character, wherein strokes generally refer to points and lines of various specific shapes which form the character without interruption, and taking a Chinese character as an example, thirty strokes of the Chinese character exist according to statistics, but the most basic strokes can be divided into five types, including: horizontal (vertical), horizontal falling (vertical), dot (vertical), broken (Chinese character) which is the smallest continuous stroke unit or the smallest structural unit forming the Chinese character font. Other strokes can be classified into one of the categories, such as "lifting" into "horizontal", "dot" into "right pressing", "vertical hook", "vertical lifting", "hook" into vertical, and "horizontal bending", "horizontal bending hook", "horizontal left-falling", "horizontal hook", "vertical hook", "left-falling", "vertical lifting", "vertical bending", "left-falling", "vertical bending hook", "oblique hook", "horizontal left-falling hook", "horizontal bending hook", "vertical bending", "horizontal bending hook", "vertical bending", "horizontal oblique hook", "horizontal bending left-falling", "vertical bending", "vertical bending", "horizontal bending", and horizontal bending "into bending.
The stroke order code is a group of codes preset for recording specific strokes and writing sequences in characters, and in one embodiment, 1 represents horizontal, 2 represents vertical, 3 represents left falling, 4 represents right falling and 5 represents folding. Specifically, the stroke order coding is to divide strokes written by Chinese characters into 5 types of horizontal, vertical, left falling, right falling and turning strokes and respectively represent the strokes by 1, 2, 3, 4 and 5 or letters or symbols, and the strokes are coded according to the stroke writing order of Chinese characters to form a stroke order character string, wherein the stroke order character string comprises: stroke order numeric string or stroke order letter string or stroke order symbol string.
Taking the word "city" as an example, the total number of strokes of the word is 9, and the strokes include: the corresponding stroke order code is 121135534.
Step 4, performing stroke order word segmentation processing on the whole stroke order combination unit (namely a stroke order character string) to obtain a local stroke order combination unit which is segmented and combined according to a preset segmentation rule;
the method for performing stroke order word segmentation processing on the stroke order character string comprises the following steps: and carrying out segmentation processing on the stroke of the minimum continuous stroke unit and carrying out combination processing on the stroke code of the minimum continuous stroke unit on the stroke order character string.
The processing method is explained as follows:
firstly, the stroke of the minimum stroke connecting unit is divided for the stroke order character string.
The stroke of the minimum continuous stroke unit of the stroke order character string is segmented by identifying the stroke of the minimum continuous stroke unit of the stroke order character string.
As for the example of the above-mentioned "city" word, the stroke order code "121135534" corresponding to 9 strokes of the stroke order character string of the word is divided to obtain the stroke codes of the minimum continuous stroke unit: 1. 2, 1, 3, 5, 3, 4.
And secondly, combining the stroke codes of the minimum continuous stroke unit.
The combination processing of the stroke codes of the minimum continuous stroke unit is to combine the stroke codes of the minimum continuous stroke unit according to a preset combination rule to obtain a local stroke order combination unit, wherein the local stroke order combination unit refers to a plurality of character parts and stroke orders thereof, which are formed by any local stroke of characters represented by the stroke order codes.
The preset combination rule comprises the following steps:
1) the whole stroke order code of each character is regarded as the whole combination unit of the character;
2) the combination of the preset stroke number of each character is regarded as a local component combination unit of the character, wherein the preset stroke number is equal to or more than 2;
as an example of the above-mentioned "city" word, the whole combination unit of the word is: 121135534, respectively;
assuming that the preset stroke number is 3, the local component combination units of the word are respectively:
121135534;
12113553;
1211355;
121135;
12113;
1211;
121;
21135534;
2113553;
211355;
21135;
2113;
211;
1135534;
113553;
11355;
1135;
113;
135534;
13553;
1355;
135;
35534;
3553;
355;
5534;
553;
534。
step 5, taking the whole stroke order combination unit and the local stroke order combination unit as retrieval keywords, retrieving the sample image database, and obtaining matched stroke order-shaped near characters; and the associated information of the stroke order form near word;
the matched stroke order near characters are matched sample characters; the associated information of the stroke order-shaped similar characters refers to the data information of the minimum unit and the combined unit of the image feature descriptor of the sample image corresponding to the sample characters;
matching means that the search keyword is the same as the whole stroke order combination unit and the local stroke order combination unit recorded in the sample image database.
And taking the whole stroke order combination unit and the local stroke order combination unit as retrieval keywords, retrieving the sample image database, acquiring matched sample characters and associated information of the sample characters, and regarding the sample characters as stroke order-shaped near characters formed by the input characters.
In the sample image database, through the processing of the above steps, the whole stroke order combination unit or the local stroke order combination unit has the associated information of the corresponding sample characters, and the method includes: and recording data information of the sample image, the corresponding characters, the minimum unit of the image feature descriptor of the sample image and the combined unit of the minimum unit. After the sample image database is searched, the related information corresponding to the stroke order unit can be indirectly obtained.
Step 6, processing the stroke order form characters to obtain the image characteristic approximation rate of each searched sample character forming the stroke order form characters and the input characters;
in the specific embodiment, the characters with the same stroke order do not necessarily form the similar characters. Taking fig. 4 and 5 as examples: although the stroke numbers of the two characters are 4 and the stroke order codes are 2534, the characters of the two characters are obviously different and should not form the shape-similar characters, and if the shape-similar characters are judged only by the stroke order codes, the judgment error of the shape-similar characters can be generated.
On the other hand, the conventional shape and word judgment is only limited to the same type of characters, and actually, shape and word judgment can be generated between different types of characters. For example: the Chinese character 'kou' and the English letter 'O' or the symbol '□' can form a similar character; the T and capital English letter T of Chinese characters can form a near character, and the like.
In practical application, in order to prevent missing detection in the search of the similar word. By calculating the image feature approximation rate of the input image corresponding to the input character and the sample image corresponding to the retrieved sample character (i.e. obtaining the image feature approximation rate between the input element and the stroke-order-shaped near character), effective information support can be provided for judging whether the two characters form the near character.
The method for acquiring the image feature approximation rate between the input element and the stroke order form near word can comprise the following steps:
the image feature approximation rate is the ratio of the image feature descriptor minimum unit match rate minus the image feature descriptor minimum unit mismatch rate. The matching rate of the minimum unit of the image feature descriptor is the matching rate of the minimum unit of the image feature descriptor of the input element and the minimum unit of the image feature descriptor of the stroke-order-shaped similar word; the image feature descriptor minimum unit mismatch rate refers to a rate at which the image feature descriptor minimum unit of the input element does not match the image feature descriptor minimum unit of the stroke shape similar word.
Based on the following formulas, an image feature descriptor minimum unit matching rate, an image feature descriptor minimum unit mismatching rate, and an image feature approximation rate can be obtained.
1. Image feature descriptor minimum unit matching rate:
Ma=(Ua÷U0)×100%
wherein M isaRepresenting the minimum unit match rate, U, of image feature descriptors0Total number of minimum units, U, of image feature descriptors representing input elementsaRepresenting a minimum unit sum number of image feature descriptors in the sample image that match the input image;
2. image feature descriptor minimum unit mismatch rate:
Mi=(Uc÷U0)×100%+(n-1)×ω
wherein M isiRepresenting the minimum unit mismatch rate, U, of image feature descriptors0Total number of minimum units, U, of image feature descriptors representing input elementscRepresenting the image feature descriptor minimum unit combination number which is not matched with the input image in the sample image, n representing the number of unmatched positions of the sample image and the input image on the image feature descriptor minimum unit combination connecting line, and omega representing the weight of the positions; wherein, the value range of omega is less than or equal to 90 percent;
3. image feature approximation rate:
M=Ma-Mi×β
wherein M represents an image feature approximation rate, and β represents MiThe weight of (a); wherein the value range of beta is less than or equal to 90%。
And 7, selecting stroke order shape near characters with the image characteristic approximation rate meeting the application requirements, and presuming the selected stroke order shape near characters as shape near characters of the input elements.
After the calculation, the search results can be sorted according to the image feature approximation rate from large to small, and characters with higher image feature approximation rate have more opportunity to form similar characters.
In practical application, the stroke order-shaped near characters with the image feature approximation rate meeting the preset threshold value can be selected.
In practical application, the matching rate of the minimum unit of the image feature descriptors, the mismatching rate of the minimum unit of the preset image feature descriptors, the approximation rate of the preset image features and the preset ranking rank can be preset according to application requirements, generally, the value of the matching rate of the minimum unit of the preset image feature descriptors is larger than 30%, the value of the mismatching rate of the minimum unit of the preset image feature descriptors is smaller than 70%, the value of the approximation rate of the preset image features is larger than 30%, and the value of the ranking rank is smaller than 300.
The preset ordering is ordering according to the image characteristic approximation rate obtained by the matched sample images. And presuming the characters corresponding to the stroke order shape near characters meeting the preset sorting ranking as the shape near characters of the input elements.
It should be understood that although the various steps in the flow charts of fig. 2-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 6, there is provided a near word recognition determining apparatus including:
the identification extraction module 610 is configured to identify an input element, extract associated information of the input element, and obtain a character corresponding to the input element; the input element is an input image or an image obtained by performing image conversion on input characters according to a preset writing font; the associated information of the input element comprises an image feature descriptor, an image feature descriptor minimum unit and combination unit data;
the encoding module 620 is configured to perform stroke order encoding on the characters corresponding to the input elements to obtain an overall stroke order combination unit;
a word segmentation module 630, configured to perform word segmentation processing on the entire stroke order combination unit to obtain a local stroke order combination unit; the stroke order word segmentation processing comprises segmentation processing according to a preset segmentation rule and combination processing according to a preset combination rule;
the retrieval module 640 is configured to retrieve the sample image database by using the whole stroke order combination unit and the local stroke order combination unit as keywords, obtain matched sample characters, and obtain associated information of a sample image corresponding to the sample characters; the associated information of the sample image comprises an image feature descriptor, an image feature descriptor minimum unit and combination unit data of the sample image;
an image characteristic approximation rate obtaining module 650, configured to determine the sample characters as stroke order-shaped near characters of the input element, and determine the associated information of the sample image as associated information of the stroke order-shaped near characters; comparing the image characteristics of the input elements and the sample image according to the associated information of the input elements and the associated information of the stroke order-shaped similar words to obtain an image characteristic approximation rate of the stroke order-shaped similar words;
the selecting module 660 is configured to determine the stroke order-shaped near character with the image feature approximation rate meeting the application requirement as the shape-shaped near character of the input element.
For specific limitations of the apparatus for determining whether to recognize a near word or not, reference may be made to the above limitations of the method for determining whether to recognize a near word or not, and details thereof are not repeated herein. All or part of the modules in the device for identifying and judging the approximate word can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server or a terminal, and its internal structure diagram may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as input characters, input images converted from the input characters, sample images, and a sample image database. The network interface of the computer device is used for communicating with an external terminal through a network connection. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like. The computer program is executed by a processor to implement a method for determining near-word recognition.
Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
identifying the input elements, extracting the associated information of the input elements, and acquiring characters corresponding to the input elements; the input element is an input image or an image obtained by performing image conversion on input characters according to a preset writing font; the associated information of the input element comprises an image feature descriptor, an image feature descriptor minimum unit and combination unit data;
performing stroke order coding on characters corresponding to the input elements to obtain an integral stroke order combination unit;
performing stroke order word segmentation processing on the whole stroke order combination unit to obtain a local stroke order combination unit; the stroke order word segmentation processing comprises segmentation processing according to a preset segmentation rule and combination processing according to a preset combination rule;
searching a sample image database by taking the whole stroke order combination unit and the local stroke order combination unit as keywords to obtain matched sample characters, and acquiring associated information of a sample image corresponding to the sample characters; the associated information of the sample image comprises an image feature descriptor, an image feature descriptor minimum unit and combination unit data of the sample image;
confirming the sample characters as stroke order-shaped near characters of the input elements, and confirming the associated information of the sample images as the associated information of the stroke order-shaped near characters; comparing the image characteristics of the input elements and the sample image according to the associated information of the input elements and the associated information of the stroke order-shaped similar words to obtain an image characteristic approximation rate of the stroke order-shaped similar words;
and confirming the stroke order shape near characters with the image characteristic approximation rate meeting the application requirement as the shape near characters of the input elements.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned method for determining near-word recognition.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (12)

1. A method for identifying and judging a shape near word is characterized by comprising the following steps:
acquiring an input element; the input element is an input image or an input character; under the condition that the input element is an input character, performing image conversion on the input character according to a preset writing font to obtain an input image;
identifying the input elements, extracting the associated information of the input elements, and acquiring characters corresponding to the input elements; the associated information of the input elements comprises image feature descriptors, minimum units of the image feature descriptors and combination unit data; the image feature descriptor is a set of one or more groups of character strings obtained by describing image features; the image feature descriptor minimum unit is each or a plurality of character strings corresponding to each image feature point of the image feature descriptor; the combination unit data is a combination unit obtained by combining all minimum units according to a preset minimum unit combination rule;
performing stroke order coding on characters corresponding to the input elements to obtain an integral stroke order combination unit;
performing stroke order word segmentation processing on the whole stroke order combination unit to obtain a local stroke order combination unit; the whole stroke order combination unit is a stroke order character string; the stroke order word segmentation processing comprises the steps of carrying out segmentation processing on strokes of the minimum continuous stroke unit on the stroke order character string and carrying out combination processing on stroke codes of the minimum continuous stroke unit;
searching a sample image database by taking the whole stroke order combination unit and the local stroke order combination unit as keywords to obtain matched sample characters, and acquiring associated information of a sample image corresponding to the sample characters; the associated information of the sample image comprises an image feature descriptor, an image feature descriptor minimum unit and combination unit data of the sample image;
confirming the sample characters as stroke order-shaped near characters of the input elements, and confirming the associated information of the sample images as the associated information of the stroke order-shaped near characters; comparing the image characteristics of the input element and the sample image according to the associated information of the input element and the associated information of the stroke order shape near word to obtain an image characteristic approximation rate of the stroke order shape near word;
and confirming the stroke order shape near character with the image feature approximation rate meeting the application requirement as the shape near character of the input element.
2. The method according to claim 1, wherein the predetermined writing fonts include a song style, a black style, and various known fonts;
the sample image comprises a pattern formed by any Chinese character in each font form, a pattern formed by any non-Chinese character in each font form, a trademark pattern with any character meaning, an appearance design pattern with any character meaning, an artwork pattern registered by any copyright with the character meaning and a custom image; the sample characters comprise Chinese characters and non-Chinese characters;
the sample image database comprises the sample characters, an integral stroke order combination unit and a local stroke order combination unit of the sample characters and sample images corresponding to the sample characters; the sample image database also comprises sample characters corresponding to the sample images, and image feature descriptors, minimum units of the image feature descriptors and combined unit data of the sample images;
the step of performing segmentation processing on the stroke of the minimum stroke connecting unit of the stroke order character string comprises the following steps: identifying the stroke of the minimum continuous stroke unit of the stroke order character string, and dividing the stroke of the minimum continuous stroke unit in the stroke order character string;
the step of combining the stroke codes of the minimum continuous stroke unit comprises the following steps: combining the stroke codes of the minimum continuous stroke unit according to a preset combination rule to obtain a local stroke order combination unit; the local stroke order combination unit refers to a plurality of character parts and stroke orders thereof, wherein the character parts are formed by any local stroke of a character represented by stroke order codes; the preset combination rule comprises the following steps: the whole stroke order code of each character is regarded as the whole combination unit of the character, and the combination of the preset stroke number of each character is regarded as the local component combination unit of the character; the value range of the preset stroke number is greater than or equal to 2;
before the steps of identifying an input element, extracting the associated information of the input element and acquiring the characters corresponding to the input element, the method further comprises the following steps:
and establishing the sample image database.
3. The method for determining shape-approximating word according to claim 2,
the integral stroke order combination unit is a stroke order character string formed by coding according to the standard stroke writing sequence; the stroke order character string is a stroke order numeric string, a stroke order letter string or a stroke order symbol string.
4. The method for determining shape-near-word recognition according to claim 2, wherein the step of creating the sample image database includes:
carrying out image feature descriptor segmentation processing on the sample image to obtain each image feature descriptor minimum unit of the sample image; the image feature descriptor minimum unit is one or more character strings corresponding to any image feature point represented by the image feature descriptor;
combining the minimum units of the image feature descriptors according to a preset minimum unit combination rule to obtain data of each combination unit of the sample image;
and
and acquiring an integral stroke order combination unit and a local stroke order combination unit of the sample characters corresponding to the sample image.
5. The method for identifying and determining shape-similar words according to any one of claims 1 to 4, wherein the step of comparing the image features of the input elements and the sample image according to the association information of the input elements and the association information of the stroke-similar words to obtain the image feature approximation rate of the stroke-similar words comprises:
acquiring the minimum unit matching rate of the image feature descriptors of the input elements and the stroke order-shaped similar characters and the minimum unit mismatching rate of the image feature descriptors; the image feature descriptor minimum unit matching rate refers to a ratio of an image feature descriptor minimum unit of an input element to an image feature descriptor minimum unit of a stroke-order-shaped near word; the image feature descriptor minimum unit mismatching rate is the rate that the image feature descriptor minimum unit of the input element is not matched with the image feature descriptor minimum unit of the stroke order shape similar word;
and determining the ratio obtained by subtracting the image feature descriptor minimum unit mismatching rate from the image feature descriptor minimum unit mismatching rate as the image feature approximation rate.
6. The method according to claim 5, wherein the step of obtaining the minimum unit matching rate of the image feature descriptor and the minimum unit non-matching rate of the image feature descriptor of the input element and the stroke order shape near word comprises:
acquiring the total number of the minimum units of the image feature descriptors of the input elements, wherein the stroke order near characters match the sum of the minimum units of the image feature descriptors of the input elements, and the stroke order near characters do not match the sum of the minimum units of the image feature descriptors of the input elements;
obtaining the minimum unit matching rate of the image feature descriptors based on the following formula:
Ma=(Ua÷U0)×100%
wherein M isaRepresenting the minimum unit match rate, U, of the image feature descriptors0A total number, U, of minimum units of image feature descriptors representing the input elementsaRepresenting a minimum unit total number of image feature descriptors of the stroke order near word matching the input element;
obtaining the image feature descriptor minimum unit mismatching rate based on the following formula:
Mi=(Uc÷U0)×100%+(n-1)×ω
wherein M isiRepresenting the image feature descriptor minimum unit mismatch rate, U0A total number, U, of minimum units of image feature descriptors representing the input elementscThe image feature descriptor minimum unit total number indicates that the stroke order-shaped near word does not match the input element, n indicates the number of unmatched positions of the stroke order-shaped near word and the input element on the connecting line of the image feature line minimum unit combination, and omega indicates the weight of the number of the matched positions; wherein, the value range of omega is less than or equal to 90 percent.
7. The method of claim 5, wherein the step of obtaining the image feature approximation rate according to the image feature descriptor minimum unit matching rate and the image feature descriptor minimum unit mismatching rate comprises:
obtaining the image feature approximation rate based on the following formula:
M=Ma-Mi×β
wherein M represents the image feature approximation rate, MaRepresenting the minimum unit match rate, M, of the image feature descriptorsiRepresents the image feature descriptor minimum unit mismatch rate, beta represents MiThe weight of (a); wherein the value range of beta is less than or equal to 90%.
8. The method according to claim 5, wherein the step of confirming the stroke order-shaped near word with the image feature approximation rate meeting the application requirement as the shape-shaped near word of the input element further comprises the steps of:
selecting the stroke order-shaped near characters of which the minimum unit matching rate of the image feature descriptors is greater than a matching rate threshold value and the minimum unit mismatching rate of the image feature descriptors is less than a mismatching rate threshold value;
the step of confirming the stroke order shape near word with the image feature approximation rate meeting the application requirement as the shape near word of the input element comprises the following steps:
and sequencing each stroke order type near character according to the image characteristic approximation rate, and determining the character corresponding to the stroke order type near character meeting the preset sequencing ranking as the type near character of the input element.
9. The method according to claim 8, wherein the matching rate threshold is 30%; the mismatch rate threshold is 70%; the preset ranking is less than 300.
10. A device for recognizing and determining a shape near word, comprising:
the identification extraction module is used for acquiring input elements; the input element is an input image or an input character; under the condition that the input element is an input character, performing image conversion on the input character according to a preset writing font to obtain an input image; identifying the input elements, extracting the associated information of the input elements, and acquiring characters corresponding to the input elements; the associated information of the input elements comprises image feature descriptors, minimum units of the image feature descriptors and combination unit data; the image feature descriptor is a set of one or more groups of character strings obtained by describing image features; the image feature descriptor minimum unit is each or a plurality of character strings corresponding to each image feature point of the image feature descriptor; the combination unit data is a combination unit obtained by combining all minimum units according to a preset minimum unit combination rule;
the coding module is used for coding the stroke order of the characters corresponding to the input elements to obtain an integral stroke order combination unit;
the word segmentation module is used for carrying out stroke order word segmentation processing on the whole stroke order combination unit to obtain a local stroke order combination unit; the whole stroke order combination unit is a stroke order character string; the stroke order word segmentation processing comprises the steps of carrying out segmentation processing on strokes of the minimum continuous stroke unit on the stroke order character string and carrying out combination processing on stroke codes of the minimum continuous stroke unit;
the retrieval module is used for retrieving a sample image database by taking the whole stroke order combination unit and the local stroke order combination unit as keywords, obtaining matched sample characters and acquiring associated information of a sample image corresponding to the sample characters; the associated information of the sample image comprises an image feature descriptor, an image feature descriptor minimum unit and combination unit data of the sample image;
the image characteristic approximation rate acquisition module is used for confirming the sample characters as stroke order-shaped near characters of the input elements and confirming the associated information of the sample images as the associated information of the stroke order-shaped near characters; comparing the image characteristics of the input element and the sample image according to the associated information of the input element and the associated information of the stroke order shape near word to obtain an image characteristic approximation rate of the stroke order shape near word;
and the selecting module is used for confirming the stroke order shape near character with the image characteristic approximation rate meeting the application requirement as the shape near character of the input element.
11. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 9 when executing the computer program.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 9.
CN201810834750.7A 2018-07-26 2018-07-26 Shape-near word recognition determination method, device, computer device and storage medium Active CN109190615B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810834750.7A CN109190615B (en) 2018-07-26 2018-07-26 Shape-near word recognition determination method, device, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810834750.7A CN109190615B (en) 2018-07-26 2018-07-26 Shape-near word recognition determination method, device, computer device and storage medium

Publications (2)

Publication Number Publication Date
CN109190615A CN109190615A (en) 2019-01-11
CN109190615B true CN109190615B (en) 2021-12-03

Family

ID=64937606

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810834750.7A Active CN109190615B (en) 2018-07-26 2018-07-26 Shape-near word recognition determination method, device, computer device and storage medium

Country Status (1)

Country Link
CN (1) CN109190615B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097002B (en) * 2019-04-30 2020-12-11 北京达佳互联信息技术有限公司 Shape and proximity word determining method and device, computer equipment and storage medium
CN110287286B (en) * 2019-06-13 2022-03-08 北京百度网讯科技有限公司 Method and device for determining similarity of short texts and storage medium
CN113743105B (en) * 2021-09-07 2022-05-24 深圳海域信息技术有限公司 Character similarity retrieval analysis method based on big data feature recognition

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101017531A (en) * 2006-02-10 2007-08-15 富士通株式会社 Character searches device
EP3048561A1 (en) * 2015-01-21 2016-07-27 Xerox Corporation Method and system to perform text-to-image queries with wildcards
CN106874947A (en) * 2017-02-07 2017-06-20 第四范式(北京)技术有限公司 Method and apparatus for determining word shape recency
CN108154167A (en) * 2017-12-04 2018-06-12 昆明理工大学 A kind of Chinese character pattern similarity calculating method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9495620B2 (en) * 2013-06-09 2016-11-15 Apple Inc. Multi-script handwriting recognition using a universal recognizer

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101017531A (en) * 2006-02-10 2007-08-15 富士通株式会社 Character searches device
EP3048561A1 (en) * 2015-01-21 2016-07-27 Xerox Corporation Method and system to perform text-to-image queries with wildcards
CN106874947A (en) * 2017-02-07 2017-06-20 第四范式(北京)技术有限公司 Method and apparatus for determining word shape recency
CN108154167A (en) * 2017-12-04 2018-06-12 昆明理工大学 A kind of Chinese character pattern similarity calculating method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
An Improved Method for Similar Handwritten Chinese Character Recognition;Fang Yang等;《2010 Third International Symposium on Intelligent Information Technology and Security Informatics》;20100422;1-7 *
基于形近字识别的互联网搜索关键字校验;王逍翔等;《第六届云南省科协学术年会暨红河流域发展论坛论文集——专题二:滇南中心智慧城市建设》;20160906;419-422 *

Also Published As

Publication number Publication date
CN109190615A (en) 2019-01-11

Similar Documents

Publication Publication Date Title
US10846553B2 (en) Recognizing typewritten and handwritten characters using end-to-end deep learning
CN108763380B (en) Trademark identification retrieval method and device, computer equipment and storage medium
EP1564675B1 (en) Apparatus and method for searching for digital ink query
US20200065601A1 (en) Method and system for transforming handwritten text to digital ink
TWI321294B (en) Method and device for determining at least one recognition candidate for a handwritten pattern
WO2017202232A1 (en) Business card content identification method, electronic device and storage medium
EP1971957B1 (en) Methods and apparatuses for extending dynamic handwriting recognition to recognize static handwritten and machine generated text
CN111898411B (en) Text image labeling system, method, computer device and storage medium
CN109190615B (en) Shape-near word recognition determination method, device, computer device and storage medium
Wei et al. A keyword retrieval system for historical Mongolian document images
US20210350122A1 (en) Stroke based control of handwriting input
CN109063197B (en) Image retrieval method, image retrieval device, computer equipment and storage medium
CN111475603A (en) Enterprise identifier identification method and device, computer equipment and storage medium
CN110414622B (en) Classifier training method and device based on semi-supervised learning
CN110569818A (en) intelligent reading learning method
Lu et al. Retrieval of machine-printed latin documents through word shape coding
Panda et al. Odia offline typewritten character recognition using template matching with unicode mapping
CN108664945B (en) Image text and shape-pronunciation feature recognition method and device
US20160098595A1 (en) Partial Overlap and Delayed Stroke Input Recognition
CN109101973B (en) Character recognition method, electronic device and storage medium
CN110909733A (en) Template positioning method and device based on OCR picture recognition and computer equipment
Kataria et al. CNN-bidirectional LSTM based optical character recognition of Sanskrit manuscripts: A comprehensive systematic literature review
CN113536771B (en) Element information extraction method, device, equipment and medium based on text recognition
CN112560849B (en) Neural network algorithm-based grammar segmentation method and system
CN115311674A (en) Handwriting processing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 528000 room 2002, block A, 33 Jihua five road, Chancheng District, Foshan, Guangdong.

Patentee after: Xu Qing

Patentee after: Foshan Guofang Identification Technology Co.,Ltd.

Patentee after: Foshan Guofang Software Technology Co.,Ltd.

Address before: 528000 room 2002, block A, 33 Jihua five road, Chancheng District, Foshan, Guangdong.

Patentee before: Xu Qing

Patentee before: FOSHAN GUOFANG TRADEMARK SERVICE Co.,Ltd.

Patentee before: FOSHAN GUOFANG TRADEMARK IDENTIFICATION TECHNOLOGY Co.,Ltd.