CN116227432A - Text processing and heteromorphic code word determining method, device and equipment - Google Patents

Text processing and heteromorphic code word determining method, device and equipment Download PDF

Info

Publication number
CN116227432A
CN116227432A CN202310199456.4A CN202310199456A CN116227432A CN 116227432 A CN116227432 A CN 116227432A CN 202310199456 A CN202310199456 A CN 202310199456A CN 116227432 A CN116227432 A CN 116227432A
Authority
CN
China
Prior art keywords
character
target
text
character set
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310199456.4A
Other languages
Chinese (zh)
Inventor
马诗涵
黄文亢
石秋慧
王洪彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202310199456.4A priority Critical patent/CN116227432A/en
Publication of CN116227432A publication Critical patent/CN116227432A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures

Abstract

The embodiment of the specification discloses a text processing method, a device and equipment for determining isomorphous code words, wherein the text processing method can determine whether a target character set contains a first character according to the first character in a text to be processed after the text to be processed is acquired, wherein the first character is any character in the text to be processed, and the characters in the target character set can be correctly processed by a subsequent text processing task; determining whether a second character exists in the target character set under the condition that the first character is not contained in the target character set, wherein the second character is a heteromorphic codeword of the first character in the target character set; and replacing the first character in the text to be processed with the second character under the condition that the second character exists in the target character set.

Description

Text processing and heteromorphic code word determining method, device and equipment
Technical Field
The present document relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for text processing and isomorphism codeword determination.
Background
A universal character set (e.g., unicode) in international scope may have some redundancy coding for compatibility with characters and symbols of multiple countries, for example, a country where two languages are not identical may use a word that looks identical to one another, but the word corresponds to two codes in the character set. In processing words, a computer uses the encoding of the word, i.e. two words that are different words to the computer even though the words appear identical. Two words that are identical in glyph but different in computer coding are referred to as homonyms.
The isomorphism code word brings a plurality of hidden troubles to the subsequent task related to word processing, for example, in the retrieval task, the isomorphism code word can cause the text retrieval result to be inconsistent with the expectation; in machine learning, heteromorphic codewords may not be encoded; for some software that relies on optical character recognition (optical character recognition, OCR) technology for text recognition, there may be some degree of false recognition, etc.
Disclosure of Invention
The embodiment of the specification provides a method, a device and equipment for text processing and isomorphism code word determination, so as to avoid hidden danger caused by isomorphism code words to subsequent word processing related tasks.
In order to solve the above technical problems, the embodiments of the present specification are implemented as follows:
in a first aspect, a text processing method is provided, including:
acquiring a text to be processed;
determining whether a target character set contains a first character in the text to be processed or not according to the first character in the text to be processed, wherein the first character is any character in the text to be processed, and the characters in the target character set can be correctly processed by a subsequent text processing task;
determining whether a second character exists in the target character set under the condition that the first character is not contained in the target character set, wherein the second character is a heteromorphic codeword of the first character in the target character set;
and replacing the first character in the text to be processed with the second character under the condition that the second character exists in the target character set.
In a second aspect, a method for determining a heteromorphic codeword is provided, including:
acquiring the code of the fifth character and the code of the sixth character;
rendering a fifth picture displayed with the fifth character according to the code of the fifth character;
rendering a sixth picture displayed with the sixth character according to the coding of the sixth character;
Determining the similarity of the fifth picture and the sixth picture;
and determining the fifth character and the sixth character as heteromorphic codewords under the condition that the similarity is greater than or equal to a preset threshold value.
In a third aspect, a text processing apparatus is provided, including:
the text acquisition module acquires a text to be processed;
the first determining module is used for determining whether a target character set contains a first character aiming at the first character in the text to be processed, wherein the first character is any character in the text to be processed, and the characters in the target character set can be correctly processed by a subsequent text processing task;
a second determining module, configured to determine, if the target character set does not include the first character, whether a second character exists in the target character set, where the second character is a heteromorphic codeword of the first character in the target character set;
and the character replacement module is used for replacing the first character in the text to be processed with the second character under the condition that the second character exists in the target character set.
In a fourth aspect, a heteromorphic codeword determination device is provided, comprising:
The code acquisition module acquires the code of the fifth character and the code of the sixth character;
the first rendering module renders a fifth picture displayed with the fifth character according to the code of the fifth character;
the second rendering module renders a sixth picture displayed with the sixth character according to the code of the sixth character;
a third determining module for determining the similarity of the fifth picture and the sixth picture;
and a fourth determining module, configured to determine the fifth character and the sixth character as heteromorphic codewords when the similarity is greater than or equal to a preset threshold.
In a fifth aspect, an electronic device is provided, including:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to:
acquiring a text to be processed;
determining whether a target character set contains a first character in the text to be processed or not according to the first character in the text to be processed, wherein the first character is any character in the text to be processed, and the characters in the target character set can be correctly processed by a subsequent text processing task;
determining whether a second character exists in the target character set under the condition that the first character is not contained in the target character set, wherein the second character is a heteromorphic codeword of the first character in the target character set;
And replacing the first character in the text to be processed with the second character under the condition that the second character exists in the target character set.
In a sixth aspect, a computer-readable storage medium storing one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to:
acquiring a text to be processed;
determining whether a target character set contains a first character in the text to be processed or not according to the first character in the text to be processed, wherein the first character is any character in the text to be processed, and the characters in the target character set can be correctly processed by a subsequent text processing task;
determining whether a second character exists in the target character set under the condition that the first character is not contained in the target character set, wherein the second character is a heteromorphic codeword of the first character in the target character set;
and replacing the first character in the text to be processed with the second character under the condition that the second character exists in the target character set.
In a seventh aspect, an electronic device is provided, including:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to:
acquiring the code of the fifth character and the code of the sixth character;
rendering a fifth picture displayed with the fifth character according to the code of the fifth character;
rendering a sixth picture displayed with the sixth character according to the coding of the sixth character;
determining the similarity of the fifth picture and the sixth picture;
and determining the fifth character and the sixth character as heteromorphic codewords under the condition that the similarity is greater than or equal to a preset threshold value.
In an eighth aspect, a computer-readable storage medium is provided, the computer-readable storage medium storing one or more programs that, when executed by an electronic device that includes a plurality of application programs, cause the electronic device to:
acquiring the code of the fifth character and the code of the sixth character;
rendering a fifth picture displayed with the fifth character according to the code of the fifth character;
Rendering a sixth picture displayed with the sixth character according to the coding of the sixth character;
determining the similarity of the fifth picture and the sixth picture;
and determining the fifth character and the sixth character as heteromorphic codewords under the condition that the similarity is greater than or equal to a preset threshold value.
According to the at least one technical scheme provided by the embodiment of the specification, the first character which is not in the target character set and is contained in the text to be processed can be replaced by the heteromorphic code word of the first character in the target character set, and as the characters in the target character set can be correctly processed by the subsequent text processing task, the heteromorphic code word which is not in the target character set is avoided from being contained in the text to be processed, so that hidden danger caused by the heteromorphic code word to the subsequent text processing related task is avoided, and the problem that the subsequent text processing task cannot correctly process the text to be processed is avoided.
Drawings
The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. Attached at
In the figure:
fig. 1 is a schematic flow chart of a text processing method according to an embodiment of the present disclosure.
Fig. 2 is a schematic diagram of a text processing method according to an embodiment of the present disclosure.
Fig. 3 is a schematic flow chart of determining a homography codeword mapping table corresponding to a target character set according to an embodiment of the present disclosure.
Fig. 4 is a schematic diagram of determining a homomorphic codeword mapping table corresponding to a target character set according to an embodiment of the present disclosure.
Fig. 5 is a flowchart of a method for determining heteromorphic codewords according to an embodiment of the present disclosure.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Fig. 7 is a schematic structural diagram of a text processing device according to an embodiment of the present disclosure.
Fig. 8 is a schematic structural diagram of a heteromorphic codeword determining device according to an embodiment of the present disclosure.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.
Since the characters and symbols of many countries are comprehensively considered at the beginning of the design, there are some phenomena that the shapes are similar or even the fonts are completely consistent before different countries, but the corresponding codes are different, such as a certain shape repetition phenomenon exists in Chinese, japanese and Korean characters in Asian areas, which can bring a plurality of hidden troubles to the task related to word processing, for example, for some software relying on optical character recognition (optical character recognition, OCR) technology for text recognition, a certain error recognition condition can exist, such as that in unicode, "you" in Chinese and "you" in Japanese are different in codes, but the codes are different in Chinese processing software, and the "you" in Japanese cannot be searched according to the codes of Chinese "you" in Chinese.
In order to avoid the hidden trouble of the isomorphic code word to the subsequent word processing related task, the embodiments of the present specification propose a text processing method and apparatus, which can be executed by an electronic device or executed by software or hardware devices installed in the electronic device. The electronic devices herein include, but are not limited to, terminal devices and servers, including, but not limited to: any of smart terminal devices such as smartphones, personal computers (personal computer, PCs), notebook computers, tablet computers, electronic readers, web televisions, wearable devices, etc., the server includes, but is not limited to: any one of a single server, a plurality of servers, a server cluster and a cloud server.
A text processing method provided in the embodiments of the present specification is first described below.
As shown in fig. 1, a text processing method according to an embodiment of the present disclosure may include:
step 102, acquiring a text to be processed.
The text to be processed may be text to be processed in any text processing environment, for example, the text to be processed may be text input in a retrieval task, or the text to be processed may be text to be encoded in machine learning, or the text to be processed may be initial text identified from a picture by using OCR technology, and so on.
Step 104, determining whether a target character set contains a first character in the text to be processed, wherein the first character is any character in the text to be processed, and the characters in the target character set can be correctly processed by a subsequent text processing task.
The target character set is the conversion target of the isomorphic code words outside the target character set, and the characters outside the target character set are converted into the isomorphic code words inside the target character set, so that the characters outside the target character set can be correctly processed by a subsequent text processing task.
As an example, the target character set is determined according to the application environment of the text to be processed, in particular selected from the original character set according to the application environment of the text to be processed. The original character set may be any character set, and as an example, the original character set may be specifically an international universal character set such as Unicode. When the original character set is an international general character set, if the application environment of the text to be processed is a Chinese environment (or the text to be processed is a Chinese text), the target character set is a Chinese character set selected from the international general character set; similarly, if the application environment of the text to be processed is a japanese environment (or the text to be processed is japanese text), the target character set is a japanese character set selected from the international character set, and so on.
In general, when determining whether the target character set includes the first character, the method may specifically determine whether the target character set includes the code of the first character, if the target character set includes the code of the first character, it is determined that the target character set includes the first character, otherwise, it does not include the first character. Because the encoding of the character is uniquely determined, the encoding of the first character is used to determine whether the target character set contains the first character with greater accuracy.
At least part (generally all) of the characters in the text to be processed can be judged one by one through step 104 to confirm whether the characters are in the target character set or out of the target character set, if so, whether the characters exist homonyms in the target character set or not needs to be judged, and if so, the characters need to be replaced (see step 106 and step 108 in detail below) so as to avoid the homonyms which are not in the target character set in the text to be processed, and thus, the hidden trouble caused by the homonyms to the related tasks of the subsequent word processing is avoided, such as the problem that the subsequent text processing task cannot process the text to be processed correctly is avoided.
In specific implementation, the characters in the text to be processed can be traversed one by one, the currently traversed characters are used as first characters to execute the step 104 of determining whether the target character set contains the first characters, if so, the currently traversed characters are skipped, the next character in the text to be processed is used as the first characters, and the step 104 of determining whether the target character set contains the first characters is executed; if not, then the "determine if a second character exists in the target character set" in step 106, described below, is performed on the first character.
Optionally, in the case that the target character set includes the first character, the first character is not processed, that is, the first character is skipped, the next character in the text to be processed is taken as the first character, and "determining whether the target character set includes the first character" in step 104 is performed, so that the process is continuously circulated until the traversal is completed.
And step 106, determining whether a second character exists in the target character set under the condition that the first character is not contained in the target character set, wherein the second character is a heteromorphic codeword of the first character in the target character set.
Step 108, replacing the first character in the text to be processed with the second character in the condition that the second character exists in the target character set.
Optionally, in the case that the second character does not exist in the target character set, taking the next character in the text to be processed as the first character and performing "determining whether the target character set includes the first character" in step 104, and subsequent steps 106 and 108, so continuously circulate until the traversal is ended.
The manner in which the second character is determined to be present in the target character set in step 106 may vary, two of which are described below.
First mode
The determining whether a second character exists in the target character set may include: acquiring the code of the first character, and rendering a first picture according to the code of the first character; acquiring codes of characters in the target character set, and rendering a plurality of second pictures according to the codes of the characters in the target character set, wherein one second picture correspondingly displays one character in the target character set; then, the similarity of the first picture and the second pictures is respectively determined, and a plurality of similarities are obtained; then determining whether target similarity greater than or equal to a preset threshold exists in the plurality of similarities; if so, determining the character displayed by the second picture corresponding to the target similarity as the second character; if not, determining that the second character does not exist in the target character set.
The first mode and the second mode are different in that the isomorphic codeword mapping table corresponding to the target character set is not predetermined, but whether the isomorphic codeword of the first character, namely the second character, exists in the target character set is determined directly through the modes of picture rendering and comparison.
Second mode
The determining whether a second character exists in the target character set may include: firstly, determining a homomorphic different codeword mapping table corresponding to the target character set; then determining whether the heteromorphic codeword mapping table contains the first character; and finally, under the condition that the heteromorphic codeword mapping table contains the first character, determining the second character according to the heteromorphic codeword mapping table.
The isomorphism codeword mapping table corresponding to the target character set refers to an isomorphism codeword mapping table taking the characters in the target character set as the primary keys, and in the isomorphism codeword mapping table, a certain record in a unique identification table of a certain character in the target character set is used, as shown in the following table 1.
Table 1 example of isomorphic codeword mapping tables for target character set
Figure BDA0004108946370000071
In table 1, the homography of the character in the target character set is generally located outside the target character set, for example, the homography corresponding to character 1, i.e., character X and character Y are located outside the target character set, and so on.
It can be appreciated that after the heteromorphic codeword mapping table corresponding to the target character set is available, whether the second character exists in the target character set can be quickly determined by querying the heteromorphic codeword mapping table.
The determining manner of the isomorphic codeword mapping table corresponding to the target character set may also have multiple clocks, which are two examples.
As shown in fig. 3, the determining the heteromorphic codeword mapping table corresponding to the target character set includes:
step 302, selecting the target character set from the original character set.
As described above, the original character set may be any character set, and as an example, the original character set may be specifically an international universal character set, such as Unicode. When the original character set is an international general character set, if the application environment of the text to be processed is a Chinese environment (or the text to be processed is a Chinese text), the target character set is a Chinese character set selected from the international general character set; similarly, if the application environment of the text to be processed is a japanese environment (or the text to be processed is japanese text), the target character set is a japanese character set selected from the international character set, and so on.
Step 304, determining whether fourth characters with the same shape as the third characters exist in a residual character set aiming at the third characters in the target character set, wherein the third characters are any characters in the target character set, and the residual character set is a character set formed by all or part of characters except the target character set in the original character set.
In specific implementation, the characters in the target character set may be traversed one by one, and the currently traversed characters are used as third characters to execute "determine whether a fourth character with the same shape as the third characters exists in the remaining character set" in step 304, if not, the currently traversed characters are skipped, the next character in the text to be processed is used as the third characters, and "determine whether the fourth character with the same shape as the third characters exists in the remaining character set" in step 304 is executed; if so, the following "record third character and fourth character in correspondence" is performed on the third character until the traversal is ended in step 306.
More specifically, the determining whether the fourth character having the same shape as the third character exists in the remaining character set may include: rendering a third picture displaying the third character according to the code of the third character; according to the codes of the characters in the residual character set, rendering a plurality of fourth pictures displayed with the characters in the residual character set, wherein one picture in the fourth pictures correspondingly displays one character in the residual character set; respectively determining the similarity of the third picture and the fourth pictures to obtain a plurality of similarities; determining whether target similarity greater than or equal to a preset threshold exists in the plurality of similarities; and under the condition that the target similarity exists in the plurality of similarities, determining that a fourth character with the same shape as the third character exists in the residual character set, wherein the fourth character is a character displayed by a fourth picture corresponding to the target similarity.
Optionally, the determining whether the fourth character having the same shape as the third character exists in the remaining character set further includes: in the case that the target similarity does not exist in the plurality of similarities, it is determined that a fourth character having the same shape as the third character does not exist in the remaining character set.
Of course, besides the above-mentioned mode of calculating the similarity of the pictures, it is also possible to determine whether the fourth character having the same shape as the third character exists in the remaining character set by querying the existing homography table of the original character set.
And 306, under the condition that a fourth character with the same shape as the third character exists in the residual character set, recording the third character and the fourth character correspondingly, and obtaining a isomorphic code word mapping table corresponding to the target character set.
The specific implementation of the above step 304 and the above step 306 are illustrated below with reference to a schematic diagram of the isomorphic codeword mapping table corresponding to the determined target character set shown in fig. 4.
As shown in fig. 4, assuming that the target character set is E, the code of the i-th character in the target character set is e_i, the remaining character set (character set formed by characters outside the target character set) is F, and the code of the j-th character in the remaining character set is f_j, where i and j are integers greater than or equal to zero. A rendering system (G is assumed) is selected, and receives a code e, and a picture is rendered, where the display content of the picture is a character corresponding to e.
On the basis of the assumption, the coding E_i in the target character set E can be traversed, a corresponding picture p_i is rendered through G, and the picture is used as a basic comparison picture; traversing the codes F_j in the residual character set F aiming at the codes E_i, and rendering a corresponding picture p_j through G; then calculating the similarity s of the picture p_i and the picture p_j through a picture similarity calculation function (such as Sim), if s is larger than or equal to a preset threshold value, considering the characters corresponding to the codes E_i and F_j as isomorphism words, and recording the characters into an isomorphism word mapping table M; if s is smaller than the preset threshold, skipping the current code E_i, and executing the steps on the code E_i+1 until the traversal is finished. It can be understood that through the above steps, the heteromorphic codeword mapping table M of all the characters corresponding to the codes in the target character set E can be obtained.
It should be noted that, two pictures displaying characters with different codes are rendered by the rendering system, and then whether the shapes of the two characters with different codes are similar is confirmed by calculating the similarity of the two pictures, so that the method is an automatic isomorphism code word determining mode, does not need to be manually participated, and has higher efficiency; in addition, the mode is not easy to have misjudgment, so that the accuracy is high; in addition, the method can find out all different-shaped codewords in the character set, and has no missing condition, so that the method is very worth popularizing.
In a second example, the determining the homography codeword mapping table corresponding to the target character set may include: selecting the target character set from the original character set; for a third character in the target character set, determining whether fourth characters with the same shape as the third character exist in a residual character set by querying an existing homography table of an original character set, wherein the third character is any character in the target character set, and the residual character set is a character set formed by all or part of characters except the target character set in the original character set; and under the condition that a fourth character with the same shape as the third character exists in the residual character set, recording the third character and the fourth character correspondingly, and obtaining a heteromorphic codeword mapping table corresponding to the target character set.
It should be noted that, through the second example, the heteromorphic codeword mapping table corresponding to the target character set may also be determined, but compared with the first example, the heteromorphic codeword mapping table corresponding to the target character set determined through the second example may not be comprehensive and accurate enough.
It can be understood that if the heteromorphic codeword mapping table corresponding to the target character set is predetermined, whether the second character exists in the target character set can be quickly determined by querying the heteromorphic codeword mapping table, and rendering and comparing the first character and the characters in the target character set again are not needed, so that the efficiency of the second mode is higher than that of the first mode.
The following is a schematic diagram of a text processing method according to the embodiment of the present disclosure shown in fig. 2, and a specific implementation procedure of the text processing method according to the embodiment of the present disclosure is illustrated.
As shown in fig. 2, assuming that the text to be processed is S, the ith character in the text to be processed is s_i, and the target character set is E, in the text processing method provided in the embodiment of the present disclosure, the character in the text to be processed S may be traversed, and with respect to the current character s_i, whether the current character s_i is included in the target character set E is determined, if so, the current character s_i is skipped, the next character in the text to be processed S is taken as the current character, and the step of "determining whether the current character s_i is included in the target character set E" is performed again until the traversing is completed; if the current character S_i is not recorded in the isomorphic codeword mapping table M corresponding to the target character set E, skipping the current character S_i, taking the next character in the text S to be processed as the current character, and returning to execute the step of judging whether the current character S_i is contained in the target character set E until the traversal is finished; if the current character S_i is recorded in the heteromorphic codeword mapping table M, the heteromorphic codeword M-k of the current character S_i in the target character set E is read from the heteromorphic codeword mapping table M, and the current character S_i in the text to be processed S is replaced by the heteromorphic codeword M-k.
According to the text processing method provided by the embodiment of the specification, the first character which is not in the target character set and is contained in the text to be processed can be replaced by the isomorphic code word of the first character in the target character set, and as the characters in the target character set can be correctly processed by the subsequent text processing task, the isomorphic code word which is not in the target character set is contained in the text to be processed can be avoided, so that hidden hazards caused by the isomorphic code word to the subsequent text processing related task are avoided, and the problem that the subsequent text processing task cannot correctly process the text to be processed is avoided.
That is, according to the text processing method provided in the embodiments of the present disclosure, isomorphic codewords outside the target character set can be mapped back (or normalized) to the target character set, and since characters in the target character set can be correctly processed by a subsequent text processing task, the text to be processed can be prevented from containing isomorphic codewords not in the target character set, thereby avoiding hidden danger caused by the isomorphic codewords to the subsequent text processing related task.
Optionally, the text processing method shown in fig. 1 may further include: and executing tasks related to word processing according to the replaced text to be processed. For example, the task related to word processing may include, but is not limited to, one or more of the following:
Searching tasks;
inputting a task;
machine learning;
a task of text recognition using optical character recognition OCR, and so on.
Optionally, based on the text processing method, the embodiment of the present disclosure further provides a method for determining a heteromorphic codeword, as shown in fig. 5, where the method may include:
step 502, obtaining the code of the fifth character and the code of the sixth character.
And 504, rendering a fifth picture displayed with the fifth character according to the code of the fifth character.
And step 506, rendering a sixth picture displayed with the sixth character according to the coding of the sixth character.
Step 508, determining the similarity of the fifth picture and the sixth picture.
In step 508, the similarity of the fifth picture and the sixth picture may be calculated according to any one of the picture similarity algorithms.
And 510, determining the fifth character and the sixth character as heteromorphic codewords under the condition that the similarity is greater than or equal to a preset threshold value.
Optionally, in the case that the similarity is smaller than a preset threshold, determining that the fifth character and the sixth character do not belong to different-shaped codewords.
It should be noted that, two pictures displaying characters with different codes are rendered by the rendering system, and then whether the shapes of the two characters with different codes are similar is confirmed by calculating the similarity of the two pictures, so that the method is an automatic isomorphism code word determining mode, does not need to be manually participated, and has higher efficiency; in addition, the mode is not easy to have misjudgment, so that the accuracy is high; in addition, the method can find out all different-shaped codewords in the character set, and has no missing condition, so that the method is very worth popularizing.
The method provided by the present specification is described above, and the electronic device provided by the present specification is described below.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. Referring to fig. 6, at the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.
The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 6, but not only one bus or type of bus.
And a memory for storing the program. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.
The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs to form the text processing device on a logic level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:
acquiring a text to be processed;
determining whether a target character set contains a first character in the text to be processed or not according to the first character in the text to be processed, wherein the first character is any character in the text to be processed, and the characters in the target character set can be correctly processed by a subsequent text processing task;
determining whether a second character exists in the target character set under the condition that the first character is not contained in the target character set, wherein the second character is a heteromorphic codeword of the first character in the target character set;
and replacing the first character in the text to be processed with the second character under the condition that the second character exists in the target character set.
Alternatively, the processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the same-shape codeword determining apparatus on a logic level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:
acquiring the code of the fifth character and the code of the sixth character;
rendering a fifth picture displayed with the fifth character according to the code of the fifth character;
rendering a sixth picture displayed with the sixth character according to the coding of the sixth character;
determining the similarity of the fifth picture and the sixth picture;
and determining the fifth character and the sixth character as heteromorphic codewords under the condition that the similarity is greater than or equal to a preset threshold value.
The method disclosed in the embodiment shown in fig. 1 or fig. 5 of the present specification may be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in one or more embodiments of the present description may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with one or more embodiments of the present disclosure may be embodied directly in a hardware decoding processor or in a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.
The electronic device may also execute the method provided in the embodiment shown in fig. 1 or fig. 5, which is not described in detail herein.
Of course, in addition to the software implementation, the electronic device in this specification does not exclude other implementations, such as a logic device or a combination of software and hardware, that is, the execution subject of the following process is not limited to each logic unit, but may also be hardware or a logic device.
The present description also proposes a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the method of the embodiment of fig. 1, and in particular to perform the operations of:
acquiring a text to be processed;
determining whether a target character set contains a first character in the text to be processed or not according to the first character in the text to be processed, wherein the first character is any character in the text to be processed, and the characters in the target character set can be correctly processed by a subsequent text processing task;
determining whether a second character exists in the target character set under the condition that the first character is not contained in the target character set, wherein the second character is a heteromorphic codeword of the first character in the target character set;
And replacing the first character in the text to be processed with the second character under the condition that the second character exists in the target character set.
The present description also proposes a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the method of the embodiment of fig. 5, and in particular to perform the operations of:
acquiring the code of the fifth character and the code of the sixth character;
rendering a fifth picture displayed with the fifth character according to the code of the fifth character;
rendering a sixth picture displayed with the sixth character according to the coding of the sixth character;
determining the similarity of the fifth picture and the sixth picture;
and determining the fifth character and the sixth character as heteromorphic codewords under the condition that the similarity is greater than or equal to a preset threshold value.
The apparatus provided in the embodiments of the present specification will be described below.
As shown in fig. 7, an embodiment of the present specification provides a text processing apparatus 700, and in a software implementation, the apparatus 700 may include: a text acquisition module 701, a first determination module 702, a second determination module 703, and a character replacement module 704.
The text acquisition module 701 acquires a text to be processed.
The text to be processed may be text to be processed in any text processing environment, for example, the text to be processed may be text input in a retrieval task, or the text to be processed may be text to be encoded in machine learning, or the text to be processed may be initial text identified from a picture by using OCR technology, and so on.
The first determining module 702 determines, for a first character in the text to be processed, whether a target character set includes the first character, where the first character is any character in the text to be processed, and the characters in the target character set can be correctly processed by a subsequent text processing task.
The target character set is the conversion target of the isomorphic code words outside the target character set, and the characters outside the target character set are converted into the isomorphic code words inside the target character set, so that the characters outside the target character set can be correctly processed by a subsequent text processing task.
As an example, the target character set is determined according to the application environment of the text to be processed, in particular selected from the original character set according to the application environment of the text to be processed. The original character set may be any character set, and as an example, the original character set may be specifically an international universal character set such as Unicode. When the original character set is an international general character set, if the application environment of the text to be processed is a Chinese environment (or the text to be processed is a Chinese text), the target character set is a Chinese character set selected from the international general character set; similarly, if the application environment of the text to be processed is a japanese environment (or the text to be processed is japanese text), the target character set is a japanese character set selected from the international character set, and so on.
In general, when determining whether the target character set includes the first character, the method may specifically determine whether the target character set includes the code of the first character, if the target character set includes the code of the first character, it is determined that the target character set includes the first character, otherwise, it does not include the first character. Because the encoding of the character is uniquely determined, the encoding of the first character is used to determine whether the target character set contains the first character with greater accuracy.
In specific implementation, the characters in the text to be processed can be traversed one by one, the currently traversed characters are used as first characters to trigger the first determining module 702 to determine whether the target character set contains the first characters, if so, the currently traversed characters are skipped, the next character in the text to be processed is used as the first characters, and the first determining module 702 is triggered to determine whether the target character set contains the first characters; if not, a second determination module 703, described below, is triggered for the first character to determine if a second character exists in the target character set.
Optionally, in the case that the target character set includes the first character, the first character is not processed, that is, the first character is skipped, the next character in the text to be processed is taken as the first character, and the first determining module 702 is triggered to determine whether the target character set includes the first character, so that the process is continuously circulated until the traversal is finished.
The second determining module 703 determines, if the target character set does not include the first character, whether a second character exists in the target character set, where the second character is a heteromorphic codeword of the first character in the target character set.
And a character replacement module 704, configured to replace the first character in the text to be processed with the second character in the case that the second character exists in the target character set.
Optionally, in the case that the second character does not exist in the target character set, taking the next character in the text to be processed as the first character and triggering the first determining module 702 to determine whether the target character set contains the first character, so that the process is continuously circulated until the traversal is finished.
The second determining module 703 may determine whether the second character exists in the target character set in a variety of manners, and two manners are described below.
In the first manner, the second determining module 703 may obtain the code of the first character, and render a first picture according to the code of the first character; acquiring codes of characters in the target character set, and rendering a plurality of second pictures according to the codes of the characters in the target character set, wherein one second picture correspondingly displays one character in the target character set; then, the similarity of the first picture and the second pictures is respectively determined, and a plurality of similarities are obtained; then determining whether target similarity greater than or equal to a preset threshold exists in the plurality of similarities; if so, determining the character displayed by the second picture corresponding to the target similarity as the second character; if not, determining that the second character does not exist in the target character set.
In the second manner, the second determining module 703 may determine the isomorphic codeword mapping table corresponding to the target character set; then determining whether the heteromorphic codeword mapping table contains the first character; and finally, under the condition that the heteromorphic codeword mapping table contains the first character, determining the second character according to the heteromorphic codeword mapping table.
The isomorphism codeword mapping table corresponding to the target character set refers to an isomorphism codeword mapping table taking the characters in the target character set as the main keys, and in the isomorphism codeword mapping table, a certain record in a certain character unique identification table in the target character set is used. It can be appreciated that after the heteromorphic codeword mapping table corresponding to the target character set is available, whether the second character exists in the target character set can be quickly determined by querying the heteromorphic codeword mapping table.
The determining manner of the isomorphic codeword mapping table corresponding to the target character set may also have multiple clocks, which are two examples.
For the first example, the second determination module 703 may select the target character set from the original character set; determining whether fourth characters with the same shape as the third characters exist in a residual character set aiming at third characters in the target character set, wherein the third characters are any characters in the target character set, and the residual character set is a character set formed by all or part of characters except the target character set in the original character set; and under the condition that a fourth character with the same shape as the third character exists in the residual character set, recording the third character and the fourth character correspondingly, and obtaining a heteromorphic codeword mapping table corresponding to the target character set.
In specific implementation, the second determining module 703 may traverse the characters in the target character set one by one, and perform a step of "determining whether there is a fourth character having the same shape as the third character in the remaining character set" with the currently traversed characters as a third character, if there is no, skip the currently traversed characters, take the next character in the text to be processed as a third character, and perform a step of "determining whether there is a fourth character having the same shape as the third character in the remaining character set"; if so, executing the step of recording the third character and the fourth character correspondingly on the third character until the traversal is finished.
More specifically, the second determining module 703 may determine whether the fourth character having the same shape as the third character exists in the remaining character set through the following process: rendering a third picture displaying the third character according to the code of the third character; according to the codes of the characters in the residual character set, rendering a plurality of fourth pictures displayed with the characters in the residual character set, wherein one picture in the fourth pictures correspondingly displays one character in the residual character set; respectively determining the similarity of the third picture and the fourth pictures to obtain a plurality of similarities; determining whether target similarity greater than or equal to a preset threshold exists in the plurality of similarities; and under the condition that the target similarity exists in the plurality of similarities, determining that a fourth character with the same shape as the third character exists in the residual character set, wherein the fourth character is a character displayed by a fourth picture corresponding to the target similarity.
Alternatively, in the case where the target similarity does not exist among the several similarities, the second determining module 703 may determine that a fourth character having the same shape as the third character does not exist in the remaining character set.
It should be noted that, two pictures displaying characters with different codes are rendered by the rendering system, and then whether the shapes of the two characters with different codes are similar is confirmed by calculating the similarity of the two pictures, so that the method is an automatic isomorphism code word determining mode, does not need to be manually participated, and has higher efficiency; in addition, the mode is not easy to have misjudgment, so that the accuracy is high; in addition, the method can find out all different-shaped codewords in the character set, and has no missing condition, so that the method is very worth popularizing.
For a second example, the second determination module 703 may select the target character set from the original character set; for a third character in the target character set, determining whether fourth characters with the same shape as the third character exist in a residual character set by querying an existing homography table of an original character set, wherein the third character is any character in the target character set, and the residual character set is a character set formed by all or part of characters except the target character set in the original character set; and under the condition that a fourth character with the same shape as the third character exists in the residual character set, recording the third character and the fourth character correspondingly, and obtaining a heteromorphic codeword mapping table corresponding to the target character set.
According to the text processing device 700 provided in the embodiment of the present disclosure, a first character contained in a text to be processed and not in a target character set can be replaced by a different codeword of the first character in the target character set, and since the characters in the target character set can be correctly processed by a subsequent text processing task, the different codeword not in the target character set can be avoided from being contained in the text to be processed, so that hidden danger caused by the different codeword to the subsequent text processing related task can be avoided, for example, the problem that the subsequent text processing task cannot correctly process the text to be processed is avoided.
Optionally, the text processing device 700 shown in fig. 7 may further include: and the task execution module is used for executing tasks related to word processing according to the replaced text to be processed. For example, the task related to word processing may include, but is not limited to, one or more of the following:
searching tasks;
inputting a task;
machine learning;
a task of text recognition using optical character recognition OCR, and so on.
It should be noted that, the text processing device 700 can implement a text processing method provided in fig. 1, and achieve the same technical effects, and the detailed contents may refer to the description of the method embodiment section above, and will not be repeated.
Alternatively, as shown in fig. 8, an embodiment of the present disclosure provides a heteromorphic codeword determining apparatus 800, and in a software implementation, the apparatus 800 may include: a code acquisition module 801, a first rendering module 802, a second rendering module 803, a third determination module 804, and a fourth determination module 805.
The code acquisition module 801 acquires the code of the fifth character and the code of the sixth character.
The first rendering module 802 renders a fifth picture displaying the fifth character according to the code of the fifth character.
And the second rendering module 803 renders a sixth picture displayed with the sixth character according to the code of the sixth character.
The third determining module 804 determines a similarity between the fifth picture and the sixth picture.
A fourth determining module 805 determines the fifth character and the sixth character as heteromorphic codewords if the similarity is greater than or equal to a preset threshold.
Optionally, the apparatus 800 may further include: and a fifth determining module, configured to determine that the fifth character and the sixth character do not belong to different codewords when the similarity is less than a preset threshold.
It should be noted that, two pictures displaying characters with different codes are rendered by the rendering system, and then whether the shapes of the two characters with different codes are similar is confirmed by calculating the similarity of the two pictures, so that the method is an automatic isomorphism code word determining mode, does not need to be manually participated, and has higher efficiency; in addition, the mode is not easy to have misjudgment, so that the accuracy is high; in addition, the method can find out all different-shaped codewords in the character set, and has no missing condition, so that the method is very worth popularizing.
It should be noted that, the heteromorphic codeword determining device 800 can implement a heteromorphic codeword determining method provided in fig. 5, and achieve the same technical effects, and the detailed description of the method embodiment section will be referred to above and will not be repeated.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.
In summary, the foregoing description is only a preferred embodiment of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of one or more embodiments of the present disclosure, is intended to be included within the scope of one or more embodiments of the present disclosure.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should be noted that the terms "first," "second," and the like in the description and in the claims are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or otherwise described herein, and that the terms "first" and "second" are generally intended to be used in a generic sense and not to limit the number of objects, for example, the first character may be one or more.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

Claims (19)

1. A text processing method, comprising:
acquiring a text to be processed;
determining whether a target character set contains a first character in the text to be processed or not according to the first character in the text to be processed, wherein the first character is any character in the text to be processed, and the characters in the target character set can be correctly processed by a subsequent text processing task;
determining whether a second character exists in the target character set under the condition that the first character is not contained in the target character set, wherein the second character is a heteromorphic codeword of the first character in the target character set;
and replacing the first character in the text to be processed with the second character under the condition that the second character exists in the target character set.
2. The method of claim 1, the determining whether a target character set contains the first character comprising:
and determining whether the target character set contains the first character according to the coding of the first character.
3. The method of claim 1, the determining whether a second character exists in the target character set, comprising:
acquiring the code of the first character, and rendering a first picture according to the code of the first character;
acquiring codes of characters in the target character set, and rendering a plurality of second pictures according to the codes of the characters in the target character set, wherein one second picture correspondingly displays one character in the target character set;
respectively determining the similarity of the first picture and the plurality of second pictures to obtain a plurality of similarities;
determining whether target similarity greater than or equal to a preset threshold exists in the plurality of similarities;
and under the condition that the target similarity exists in the plurality of similarities, determining the character displayed by the second picture corresponding to the target similarity as the second character.
4. The method of claim 3, the determining whether a second character exists in the target character set, further comprising:
In the event that the target similarity does not exist among the number of similarities, it is determined that the second character does not exist in the target character set.
5. The method of claim 1, the determining whether a second character exists in the target character set, comprising:
determining a homomorphic different codeword mapping table corresponding to the target character set;
determining whether the heteromorphic codeword mapping table contains the first character;
and under the condition that the heteromorphic codeword mapping table contains the first character, determining the second character according to the heteromorphic codeword mapping table.
6. The method of claim 5, determining a homography codeword mapping table for the target character set, comprising:
selecting the target character set from the original character set;
determining whether fourth characters with the same shape as the third characters exist in a residual character set aiming at third characters in the target character set, wherein the third characters are any characters in the target character set, and the residual character set is a character set formed by all or part of characters except the target character set in the original character set;
and under the condition that a fourth character with the same shape as the third character exists in the residual character set, recording the third character and the fourth character correspondingly, and obtaining a heteromorphic codeword mapping table corresponding to the target character set.
7. The method of claim 6, the determining whether a fourth character of the same shape as the third character exists in the remaining character set, comprising:
rendering a third picture displaying the third character according to the code of the third character;
according to the codes of the characters in the residual character set, rendering a plurality of fourth pictures displayed with the characters in the residual character set, wherein one picture in the fourth pictures correspondingly displays one character in the residual character set;
respectively determining the similarity of the third picture and the fourth pictures to obtain a plurality of similarities;
determining whether target similarity greater than or equal to a preset threshold exists in the plurality of similarities;
and under the condition that the target similarity exists in the plurality of similarities, determining that a fourth character with the same shape as the third character exists in the residual character set, wherein the fourth character is a character displayed by a fourth picture corresponding to the target similarity.
8. The method of claim 7, the determining whether a fourth character of the same shape as the third character exists in the remaining character set, further comprising:
In the case that the target similarity does not exist in the plurality of similarities, it is determined that a fourth character having the same shape as the third character does not exist in the remaining character set.
9. The method according to claim 6 to 8,
the original character set is an arbitrary character set.
10. The method according to claim 9, wherein the method comprises,
the original character set is an international universal character set.
11. The method according to any one of claim 1 to 8, 10,
the target character set is determined according to the application environment of the text to be processed.
12. The method of any one of claims 1-8, 10, further comprising:
and executing tasks related to word processing according to the replaced text to be processed.
13. A method of heteromorphic codeword determination, comprising:
acquiring the code of the fifth character and the code of the sixth character;
rendering a fifth picture displayed with the fifth character according to the code of the fifth character;
rendering a sixth picture displayed with the sixth character according to the coding of the sixth character;
determining the similarity of the fifth picture and the sixth picture;
and determining the fifth character and the sixth character as heteromorphic codewords under the condition that the similarity is greater than or equal to a preset threshold value.
14. A text processing apparatus, comprising:
the text acquisition module acquires a text to be processed;
the first determining module is used for determining whether a target character set contains a first character aiming at the first character in the text to be processed, wherein the first character is any character in the text to be processed, and the characters in the target character set can be correctly processed by a subsequent text processing task;
a second determining module, configured to determine, if the target character set does not include the first character, whether a second character exists in the target character set, where the second character is a heteromorphic codeword of the first character in the target character set;
and the character replacement module is used for replacing the first character in the text to be processed with the second character under the condition that the second character exists in the target character set.
15. A heteromorphic codeword determining apparatus, comprising:
the code acquisition module acquires the code of the fifth character and the code of the sixth character;
the first rendering module renders a fifth picture displayed with the fifth character according to the code of the fifth character;
the second rendering module renders a sixth picture displayed with the sixth character according to the code of the sixth character;
A third determining module for determining the similarity of the fifth picture and the sixth picture;
and a fourth determining module, configured to determine the fifth character and the sixth character as heteromorphic codewords when the similarity is greater than or equal to a preset threshold.
16. An electronic device, comprising:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to:
acquiring a text to be processed;
determining whether a target character set contains a first character in the text to be processed or not according to the first character in the text to be processed, wherein the first character is any character in the text to be processed, and the characters in the target character set can be correctly processed by a subsequent text processing task;
determining whether a second character exists in the target character set under the condition that the first character is not contained in the target character set, wherein the second character is a heteromorphic codeword of the first character in the target character set;
and replacing the first character in the text to be processed with the second character under the condition that the second character exists in the target character set.
17. A computer-readable storage medium storing one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to:
acquiring a text to be processed;
determining whether a target character set contains a first character in the text to be processed or not according to the first character in the text to be processed, wherein the first character is any character in the text to be processed, and the characters in the target character set can be correctly processed by a subsequent text processing task;
determining whether a second character exists in the target character set under the condition that the first character is not contained in the target character set, wherein the second character is a heteromorphic codeword of the first character in the target character set;
and replacing the first character in the text to be processed with the second character under the condition that the second character exists in the target character set.
18. An electronic device, comprising:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to:
Acquiring the code of the fifth character and the code of the sixth character;
rendering a fifth picture displayed with the fifth character according to the code of the fifth character;
rendering a sixth picture displayed with the sixth character according to the coding of the sixth character;
determining the similarity of the fifth picture and the sixth picture;
and determining the fifth character and the sixth character as heteromorphic codewords under the condition that the similarity is greater than or equal to a preset threshold value.
19. A computer-readable storage medium storing one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to:
acquiring the code of the fifth character and the code of the sixth character;
rendering a fifth picture displayed with the fifth character according to the code of the fifth character;
rendering a sixth picture displayed with the sixth character according to the coding of the sixth character;
determining the similarity of the fifth picture and the sixth picture;
and determining the fifth character and the sixth character as heteromorphic codewords under the condition that the similarity is greater than or equal to a preset threshold value.
CN202310199456.4A 2023-02-24 2023-02-24 Text processing and heteromorphic code word determining method, device and equipment Pending CN116227432A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310199456.4A CN116227432A (en) 2023-02-24 2023-02-24 Text processing and heteromorphic code word determining method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310199456.4A CN116227432A (en) 2023-02-24 2023-02-24 Text processing and heteromorphic code word determining method, device and equipment

Publications (1)

Publication Number Publication Date
CN116227432A true CN116227432A (en) 2023-06-06

Family

ID=86578345

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310199456.4A Pending CN116227432A (en) 2023-02-24 2023-02-24 Text processing and heteromorphic code word determining method, device and equipment

Country Status (1)

Country Link
CN (1) CN116227432A (en)

Similar Documents

Publication Publication Date Title
CN110362370B (en) Webpage language switching method and device and terminal equipment
CN109271611B (en) Data verification method and device and electronic equipment
CN105843800A (en) DOI-based language information display method and device
CN111898380A (en) Text matching method and device, electronic equipment and storage medium
CN110263050B (en) Data processing method, device, equipment and storage medium
CN112347512A (en) Image processing method, device, equipment and storage medium
CN112613513A (en) Image recognition method, device and system
CN111368902A (en) Data labeling method and device
CN116108150A (en) Intelligent question-answering method, device, system and electronic equipment
JP2019522847A (en) Method, device and terminal device for extracting data
CN107239209B (en) Photographing search method, device, terminal and storage medium
CN113064556A (en) BIOS data storage method, device, equipment and storage medium
CN115221523B (en) Data processing method, device and equipment
CN116227432A (en) Text processing and heteromorphic code word determining method, device and equipment
CN109388685B (en) Method and device for warehousing spatial data used by planning industry
CN110866085A (en) Data feedback method and device
CN110059563B (en) Text processing method and device
US20160350318A1 (en) Method, system for classifying comment record and webpage management device
CN110362790B (en) Font file processing method and device, electronic equipment and readable storage medium
CN114463068A (en) Data processing method and device
CN110018844B (en) Management method and device of decision triggering scheme and electronic equipment
CN114356912A (en) Method for writing data into database and computer equipment
CN110083576B (en) Cache directory identification method and device
CN108734149B (en) Text data scanning method and device
CN109190352B (en) Method and device for verifying accuracy of authorization text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination