CN114638218A - Symbol processing method, device, electronic equipment and storage medium - Google Patents

Symbol processing method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114638218A
CN114638218A CN202210293122.9A CN202210293122A CN114638218A CN 114638218 A CN114638218 A CN 114638218A CN 202210293122 A CN202210293122 A CN 202210293122A CN 114638218 A CN114638218 A CN 114638218A
Authority
CN
China
Prior art keywords
symbol
character string
recognized
target coding
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210293122.9A
Other languages
Chinese (zh)
Inventor
许锴霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202210293122.9A priority Critical patent/CN114638218A/en
Publication of CN114638218A publication Critical patent/CN114638218A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Character Discrimination (AREA)

Abstract

The present disclosure relates to the field of computer technologies, and in particular, to a symbol processing method and apparatus, an electronic device, and a storage medium. The symbol processing method provided by the embodiment of the disclosure comprises the following steps: acquiring a character string to be recognized; determining the number of first target coding units corresponding to the character strings to be identified and the number of target coding units corresponding to the character strings to be identified; determining a candidate symbol set from a preset dictionary, wherein the candidate symbol set comprises at least one candidate symbol; the first target coding units corresponding to the candidate symbols are the same as the first target coding units corresponding to the character strings to be recognized, and the number of the target coding units corresponding to the candidate symbols is not more than the number of the target coding units corresponding to the character strings to be recognized; and matching the candidate symbols in the candidate symbol set with the character string to be recognized, and determining the target symbols from the character string to be recognized based on the matching result.

Description

Symbol processing method, device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a symbol processing method and apparatus, an electronic device, and a storage medium.
Background
In the related webpage development technology, in order to improve the webpage rendering performance, a Canvas (Canvas) rendering technology is used for rendering a scene with complex rendering elements, and the elements of a page need to be typeset when the elements are rendered by the Canvas technology.
In the field of rendering of characters on canvas, typesetting, text line feed, etc. are required to be performed in combination with the font and the font width of the characters). In a web page, all characters have unique corresponding Unicode (Unicode) code points. However, some text, such as some composite emoticons (emoji), may be composed of multiple individual characters, and thus a technique for recognizing, for example, composite emoticons is required. Because the text is accurately laid out in the canvas, it depends on accurately recognizing a string of characters as an independent character, and once a recognition error results in an erroneous layout result.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, according to one or more embodiments of the present disclosure, there is provided a symbol processing method, including:
acquiring a character string to be identified, wherein the character string to be identified corresponds to at least one target coding unit;
determining the number of first target coding units corresponding to the character strings to be identified and the number of target coding units corresponding to the character strings to be identified;
determining a candidate symbol set from a preset dictionary, wherein the candidate symbol set comprises at least one candidate symbol; the first target coding unit corresponding to the candidate symbol is the same as the first target coding unit corresponding to the character string to be identified, and the number of the target coding units corresponding to the candidate symbol is not more than the number of the target coding units corresponding to the character string to be identified;
and matching the candidate symbols in the candidate symbol set with the character string to be recognized, and determining a target symbol from the character string to be recognized based on a matching result.
In a second aspect, according to one or more embodiments of the present disclosure, there is provided a symbol processing apparatus including:
the device comprises a character to be recognized acquisition unit, a character recognition unit and a character recognition unit, wherein the character to be recognized acquisition unit is used for acquiring a character string to be recognized, and the character string to be recognized corresponds to at least one target coding unit;
the character to be recognized determining unit is used for determining the number of the first target coding unit corresponding to the character string to be recognized and the number of the target coding units corresponding to the character string to be recognized;
the candidate symbol determining unit is used for determining a candidate symbol set from a preset dictionary, wherein the candidate symbol set comprises at least one candidate symbol; the first target coding unit corresponding to the candidate symbol is the same as the first target coding unit corresponding to the character string to be identified, and the number of the target coding units corresponding to the candidate symbol is not more than the number of the target coding units corresponding to the character string to be identified;
and the matching unit is used for matching the candidate symbols in the candidate symbol set with the character string to be recognized and determining target symbols from the character string to be recognized based on the matching result.
In a third aspect, according to one or more embodiments of the present disclosure, there is provided an electronic device including: at least one memory and at least one processor; wherein the memory is configured to store program code, and the processor is configured to call the program code stored in the memory to cause the electronic device to execute a symbol processing method provided according to one or more embodiments of the present disclosure.
In a fourth aspect, according to one or more embodiments of the present disclosure, there is provided a non-transitory computer storage medium storing program code which, when executed by a computer device, causes the computer device to perform a symbol processing method provided according to one or more embodiments of the present disclosure.
According to one or more embodiments of the disclosure, the number of first target coding units corresponding to a character string to be recognized and the number of target coding units corresponding to the character string to be recognized are determined, candidate symbols which are the same as the first target coding units of the character string to be recognized and the number of target coding units is smaller than the number of target coding units of the character string to be recognized are determined from a preset dictionary, each candidate symbol is matched with the character string to be recognized, and the target symbols are determined from the character string to be recognized based on the matching result, so that the target characters can be recognized from the character string efficiently and accurately, and further, the problems of webpage layout abnormality, cursor position rendering error and the like caused by target symbol recognition errors can be avoided.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
FIG. 1 is a schematic view of a composite emoticon;
fig. 2 is a flowchart of a symbol processing method according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of a character string to be recognized and its first character and a candidate symbol set according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a symbol recognition method according to another embodiment of the present disclosure;
fig. 5 is a flowchart of a method for generating a predetermined dictionary according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a symbol processing apparatus provided in accordance with an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more complete and thorough understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the steps recited in the embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Moreover, embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". The term "responsive to" and related terms mean that one signal or event is affected to some extent, but not necessarily completely or directly, by another signal or event. If an event x occurs "in response" to an event y, x may respond directly or indirectly to y. For example, the occurrence of y may ultimately result in the occurrence of x, but other intermediate events and/or conditions may exist. In other cases, y may not necessarily result in the occurrence of x, and x may occur even though y has not already occurred. Furthermore, the term "responsive to" may also mean "at least partially responsive to".
The term "determining" broadly encompasses a wide variety of actions that can include obtaining, calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like, and can also include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory) and the like, as well as resolving, selecting, choosing, establishing and the like, and the like. Relevant definitions for other terms will be given in the following description. Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
For the purposes of this disclosure, the phrase "a and/or B" means (a), (B), or (a and B).
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
To assist in understanding the various embodiments of the present disclosure, the following noun explanations are provided. Other terms and explanations may be given in light of the context of the present disclosure.
(1) Unicode: unicode, also known as Wan Guo code or unicode, is an industry standard in the field of computer science and can accommodate the character coding schemes of all characters and symbols in the world.
(2) Unicode Code point (Code point): the value or position of a character in the set of code characters. Any value in the Unicode code space, i.e., from 0hexTo 10FFFFhexThe range of integers of (1).
(3) Unicode Scalar (Unicode scale): a Unicode Scalar is a basic coding unit of Unicode, and a Unicode Scalar Value (Unicode Scalar Value) is any Unicode code point other than high-and low-proxy code points, i.e., the integer 0hexTo D7FFhexAnd E000hexTo 10FFFFhex(inclusive) of the range.
(4) Character (Character): basic coding unit of Unicode character coding.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure.
An emoticon may be a single character only (i.e., corresponding to only one Unicode scalar) or may be composed of multiple characters (i.e., corresponding to multiple Unicode scalars). Referring to fig. 1, fig. 1 shows a schematic view of a plurality of combined emoticons. The combined emoticon 11 is an emoticon Zero-wide hyphen sequence (Emoji ZWJ sequence) composed of a plurality of emoticons connected by Zero Width hyphens (ZWJ), which is displayed as an emoticon on a supported platform. The combined emoticon 11 is composed of a male emoticon, a zero-width hyphen, a female emoticon, a zero-width hyphen, a girl emoticon, a zero-width hyphen, and a girl emoticon in sequence, and corresponds to 7 Unicode scalars in total. A zero-width hyphen (ZWJ) is a Unicode character that is an invisible character that acts as an adhesive to press other characters together to create a new emoticon.
The combined emoticon 12 is also an emoticon zero-width hyphen sequence, which is composed of female emoticons, zero-width hyphens and notebook emoticons in sequence, and corresponds to 3 Unicode scalars in total.
The combined emoticon 13 is an expression Modifier Sequence (Emoji Modifier Sequence) and is composed of an OK gesture emoticon and a skin color Modifier in Sequence, and corresponds to 2 Unicode scalars. When a modifier is inserted after the supported base emoticon character, an emoticon modifier sequence is automatically created, thereby generating a single emoticon with skin tone.
The combined emoticon 145 is composed of a numeric character "1", a Variation Selector-16, and a combined closed key cap emoticon in sequence, and corresponds to 3 Unicode variables. The variant connector is an invisible code point specifying that the preceding character should be displayed with the emoticon.
The combined emoticons 11, 12, 13, and 14 are displayed as a separate emoticon on the supported platform.
It can be seen that an individual combined symbol is actually formed by combining a plurality of independent symbols, and if the combined symbol is not correctly identified as an individual symbol, an abnormal word width can be obtained when the canvas is rendered and typeset, so that page typesetting is disordered and not attractive, and even a large number of blank and character overlapping problems occur. In addition, when the canvas is used to render content, the rendering cursor position needs to be considered, if the combined symbol cannot be accurately identified, the cursor movement will be abnormal, for example, the combined emoticon 11 is composed of 7 Unicode scalars, the character length of the combined emoticon is 7, but if the 7 Unicode scalars cannot be identified as a single symbol, the cursor will be abnormally placed behind the 7 characters, and the usability of the webpage is seriously affected.
Referring to fig. 2, fig. 2 shows a flowchart of a symbol processing method 100 provided by an embodiment of the present disclosure, where the method 100 includes:
step S120: and acquiring a character string to be identified, wherein the character string to be identified corresponds to at least one target coding unit.
In some embodiments, the character string to be recognized may be a character string input by a user. Illustratively, a user may enter a segment of text that may include chinese characters, symbols, letters, numbers, emoticons, and the like.
In some embodiments, the target coding unit may include a Unicode scalar or character, or a basic coding unit under other coding standards, which is not limited herein.
Step S140: and determining the number of the first target coding unit corresponding to the character string to be identified and the number of the target coding units corresponding to the character string to be identified.
Step S160: determining a candidate symbol set from a preset dictionary, wherein the candidate symbol set comprises at least one candidate symbol; and the first target coding unit corresponding to the candidate symbol is the same as the first target coding unit corresponding to the character string to be identified, and the number of the target coding units corresponding to the candidate symbol is not more than the number of the target coding units corresponding to the character string to be identified.
Step S180: and matching the candidate symbols in the candidate symbol set with the character string to be recognized, and determining a target symbol from the character string to be recognized based on a matching result.
In some embodiments, the target symbols are combined symbols, one combined symbol comprising at least two target coding units.
In some embodiments, the combined symbol comprises a combined emoticon.
The following is an exemplary description of a combined emoticon. Assuming that the character string to be recognized is the character string 30 to be recognized shown in fig. 3, which includes how the combined emoticon 12 and the text "abc" shown in fig. 1 correspond to 6 Unicode scalars (where the combined emoticon 12 corresponds to 3 Unicode scalars), it is determined that the first Unicode scalar or character 121 corresponding to the character string 30 to be recognized, i.e., the female emoticon (whose Unicode code point is 1F469), and the number or length of the Unicode scalars corresponding to the character string 30 to be recognized are 6. Then, if P symbols are found from the preset dictionary, which have the female emoticon 121 (whose Unicode code point is 1F469) as the first Unicode scalar and the number of corresponding characters or Unicode scalars is not more than 6 (i.e., the length is less than or equal to 6), then the P symbols are taken as candidate symbols to form a candidate symbol set. Referring to fig. 3, a candidate symbol set 31 corresponding to a character string 30 to be recognized is schematically shown, which includes several candidate symbols. Finally, the candidate symbols in the candidate symbol set 31 are matched with the character string 300 to be recognized, and a target symbol, namely the combined emoticon 12, is determined from the character string to be recognized based on the matching result.
It should be noted that the candidate symbols may be represented in the form of a combination symbol, a character, and/or a Unicode code point sequence, or in other equivalent forms, such as description of the symbols, and the embodiment is not limited herein. For example, referring to FIG. 1, the combined emoticon 12 is composed of, in order, a female emoticon, a zero-width ligature, and a notebook emoticon, and the corresponding Unicode code point sequence is [1F469,200D,1F4BB ], which is described under Unicode code as "wman technology" (female technician).
In this way, according to the symbol recognition method provided by one or more embodiments of the present disclosure, by determining the number of the first target coding unit corresponding to the character string to be recognized and the number of the target coding units corresponding to the character string to be recognized, candidate symbols whose first target coding unit is the same as the first target coding unit of the character string to be recognized and whose number of target coding units is less than the number of target coding units of the character string to be recognized are determined from a preset dictionary, and by matching each candidate symbol with the character string to be recognized and determining the target symbols from the character string to be recognized based on the matching result, the target characters can be efficiently and accurately recognized from the character string, and further, the problems of web page abnormality, cursor position rendering error and the like caused by the recognition error of the target symbols can be avoided.
In some embodiments, the predetermined dictionary is dedicated to recording symbols of the same type as the target symbol. The target symbol is taken as an example for schematic explanation. The key words of the preset dictionary are target coding units, and the values corresponding to the key words are at least one combined symbol with the key words as the beginning. In this embodiment, after the first target coding unit of the character string to be recognized is obtained, the first target coding unit is used as a keyword to query in a preset dictionary, so that all the combination symbols beginning with the target coding unit can be queried. For example, the keyword and the corresponding Value thereof may be stored in a dictionary in a Key-Value pair (Key-Value) manner, for example, in the dictionary, if Key a is the target coding unit a, the first target coding units of all symbols included in Value a corresponding to Key a are the target coding unit a. If the first target coding unit of the character string to be recognized does not belong to the key of the dictionary, it can be determined that the first symbol of the character string to be recognized is not the combined symbol. In this way, by recording the combined symbol by the dedicated dictionary, it is possible to judge the possibility that the first symbol of the character string to be recognized is the combined symbol within the O (1) time. Those skilled in the art will appreciate that the above likelihood determinations relate to the overall degree to which the dictionary listing keywords are associated.
In some embodiments, step S180 includes:
step A1: and splitting the target symbol from the character string to be recognized.
In this embodiment, by splitting the determined target symbol from the character string to be recognized, the subsequent processing related to the target symbol, such as web page layout, cursor position positioning, and the like, can be facilitated.
In some embodiments, before performing step S160, the method 100 further comprises the steps of: and determining that the target coding unit corresponding to the character string to be identified comprises at least one emoticon. In this embodiment, only when it is determined that the target coding unit corresponding to the character string to be recognized includes at least one emoticon, the target symbols are matched based on the preset dictionary, so that the preset dictionary can be effectively utilized, and the calculation resources are saved.
In some embodiments, step S180 includes:
step B1: and matching the candidate symbols in the candidate symbol set with the character string to be recognized in sequence from more to less based on the number of the corresponding target coding units.
The following is an exemplary description taking a combined emoticon as an example. The combined emoticon may have one or more identical target coding units therebetween, for example, assuming that the combined emoticon a is composed of a male emoticon, a zero-wide hyphen, a female emoticon, a zero-wide hyphen, a girl emoticon, a zero-wide hyphen, and a girl emoticon in this order, and the combined emoticon B is composed of a male emoticon, a zero-wide hyphen, and a female emoticon in this order, the first 3 target coding units of the combined emoticon a and the combined emoticon B are all identical. If it is only determined that the combined emoticon B can be completely the same as the first 3 target coding units of the character string to be recognized, the conclusion that the first symbol of the character string to be recognized is the combined emoticon B cannot be directly obtained at this time, and it still needs to be further determined whether the combined emoticon a is matched with the first 7 target coding units of the character string to be recognized. However, if it is determined that the longer combined expression a matches the first 7 target coding units of the character string to be recognized, it is not necessary to determine whether the shorter combined expression B matches the character string to be recognized. In this way, in this embodiment, the candidate symbols in the candidate symbol set are sequentially matched with the character string to be recognized in the order from the largest to the smallest based on the number of the corresponding target coding units, so that the recognition speed of the target symbols can be improved.
In some embodiments, step S180 includes the following matching steps:
step C10: determining a numerical value M of a target coding number corresponding to a candidate symbol with a larger target coding unit number Nth in the candidate symbol set, and intercepting the first M target coding units from the character string to be recognized as a first character string to be recognized to be matched with the candidate symbols in the candidate symbol set; wherein the initial value of N is 1;
step C21: if a first candidate symbol identical to the first character string to be recognized exists in the candidate symbol set, taking the first candidate symbol as the matching result;
step C22: and if the first candidate symbol which is the same as the first character string to be recognized does not exist in the candidate symbol set, adding 1 to the value of N and repeatedly executing the matching step.
An exemplary description is provided below. Assuming that the number of target coding units corresponding to the character string to be identified is 7 (i.e. the length is 7), and the number of target coding units corresponding to each candidate symbol has the following 3 types: 5. and 3, 2, when the value of N is 1, if the number of target coding units corresponding to the first largest candidate symbol (i.e., the longest candidate symbol) is 5 (i.e., M is 5), then the first 5 target coding units are intercepted from the character string to be recognized as the first character string a to be recognized, and are matched with all the candidate vectors corresponding to the target coding number of 5 (i.e., the longest candidate symbol). If the candidate symbol A which is completely the same as the first character string a to be recognized exists, the candidate symbol A is taken as a matching result. If the candidate symbol B is not found, adding the value of N and repeatedly executing the matching step, that is, determining that the value of the target code number corresponding to the candidate symbol with the second largest target code unit number (that is, the second largest length) in the candidate symbol set is 3 (in this case, N is 2, and M is 3), cutting out the first 3 target code units from the character string to be recognized as the first character string B to be matched with all the candidate vectors with the corresponding target code number of 3 (that is, the length of 3), and if there is a candidate symbol B that is completely the same as the first character string B to be recognized, taking the candidate symbol B as the matching result. If the candidate symbols are not stored, adding the value of N and repeatedly executing a matching step, that is, determining that the value of the target code number corresponding to the candidate symbol with the third largest target code unit number (that is, the third length) in the candidate symbol set is 3 (in this case, N is 3, and M is 2), intercepting the first 2 target code units in the character string to be recognized as a first character string C to be recognized, matching the first 2 target code units with all the candidate vectors with the corresponding target code number of 2 (that is, the length of 2), and if a candidate symbol C identical to the first character string C exists, taking the candidate symbol C as a matching result; if not, the first symbol of the character string to be recognized is determined not to be the target symbol.
It should be noted that, in the present embodiment, the first character string to be recognized is identical to the candidate symbols, including the number, code points and relative sequence of the target coding units, and for example, if the Unicode code point sequence corresponding to the first character string to be recognized is [1F469,200D,1F4BB ], the Unicode code point sequence corresponding to the identical candidate symbol should also be [1F469,200D,1F4BB ].
In some embodiments, a keyword of the preset dictionary is a target coding unit, and a value corresponding to the keyword is at least one symbol with the keyword as a first target coding unit. Thus, by searching the keyword, a set of symbols with the keyword as the first target coding unit can be obtained. In this embodiment, after the first target coding unit of the character string to be recognized is obtained, the first target coding unit is used as a keyword to query in a preset dictionary, so that all symbols beginning with the target coding unit can be queried. For example, the keyword and the Value corresponding to the keyword may be stored in a dictionary in a Key-Value pair (Key-Value) manner, for example, in the dictionary, if Key a is target coding unit a, the first target coding units of all symbols included in Value a corresponding to Key a are target coding unit a.
It should be noted that, in one or more embodiments of the present disclosure, the target coding unit as a keyword in the dictionary may be represented in the form of a character, a Unicode code point, or in other equivalent forms, such as a description of the target coding unit; the symbols in the dictionary and the values taken as the keywords may be represented in the form of combined symbols, characters, and/or Unicode code point sequences, or in other equivalent forms, such as descriptions of the symbols, and the embodiment is not limited herein. For example, referring to FIG. 1, the combined emoticon 12 is composed of, in order, a female emoticon, a zero-width ligature, and a notebook emoticon, and the corresponding Unicode code point sequence is [1F469,200D,1F4BB ], which is described under Unicode code as "wman technology" (female technician).
In some embodiments, before performing step S160, the method 100 further comprises:
step S150: determining whether the first target coding unit is a first coding unit; if the first target coding unit is not the first coding unit, performing step S160; and if the first target coding unit is the first coding unit, determining that the starting symbol of the character string to be identified is not the target symbol.
In this embodiment, by determining whether the first target encoding unit is the first encoding unit, and if the first target encoding unit is the first encoding unit, determining that the start symbol of the character string to be recognized is not the target symbol, the possibility that the first symbol of the character string to be recognized is the target symbol may be determined in advance, and when it is determined that the first target encoding unit is not the target symbol (i.e., the first target encoding unit is the first encoding unit), the current determination process may be ended in advance.
In some embodiments, the target character is a combination symbol and the first coding unit is a target coding unit for only one character symbol, wherein the one character symbol consists of one target coding unit; the step S150 includes: determining whether the first target coding unit is a first coding unit based on a preset first coding unit set; wherein the first set of coding units comprises at least one first coding unit.
In the embodiment, all target coding units only used for single-character symbols are arranged in a set, and whether the first target coding unit of the character string to be recognized belongs to the set is judged, so that whether the first symbol of the character string to be recognized is a single-character symbol can be judged in O (1) time, and the query efficiency can be improved.
The following is an exemplary description taking an emoticon as an example. Some emoji characters can be used independently as separate emoticons or can be used as part of other combined emoticons, for example, the female emoticon shown in fig. 1 can be used as a separate emoticon or as the first character of the combined emoticon 12. However, there are some emoticons that are used only as a separate emoticon (i.e., a single character symbol) and cannot be used for other combination characters. In this regard, according to one or more embodiments of the present disclosure, when it is determined that the first character of the character string to be recognized is only for an individual emoticon, it may be determined that the character is unlikely to be the starting character of the combined emoticon, and thus it is determined that the first character of the character string to be recognized is unlikely to be the combined emoticon, so that the determination process may be ended in advance.
Referring to fig. 4, fig. 4 shows that the symbol recognition method 400 provided according to an embodiment of the present disclosure includes:
step S401: acquiring a character string to be identified, wherein the character string to be identified corresponds to at least one target coding unit;
step S402: determining the first target coding unit corresponding to the character string to be identified and the number of the target coding units corresponding to the character string to be identified
Step S403: judging whether the first target coding unit possibly corresponds to a target symbol; if the target symbol is determined to be likely to correspond to, then step S404 is executed; if the target symbol is not possible to be corresponded, executing step S409;
step S404: determining a candidate symbol set from a preset dictionary, wherein the candidate symbol set comprises at least one candidate symbol; the first target coding unit corresponding to the candidate symbol is the same as the first target coding unit corresponding to the character string to be identified, and the number of the target coding units corresponding to the candidate symbol is not more than the number of the target coding units corresponding to the character string to be identified;
step S405: judging whether the candidate symbol set is empty or not; if not, executing step S406; if the judgment result is null, executing the step S409;
step S406: determining a numerical value M of a target coding number corresponding to a candidate symbol with a larger number Nth of target coding units in the candidate symbol set, intercepting the previous M target coding units from the character string to be recognized as a first character string to be recognized, and taking all the candidate symbols with the larger number Nth of target coding units from the candidate symbol set to be matched with the first character string to be recognized; the initial value of N is 1;
step S407: judging whether the selected candidate symbol is matched with the first character string to be recognized or not; if yes, go to step S408; if not, continuing to return to the step S405, and adding one to the value of N when the step S406 is executed next time, and so on until the step S407 or the step S405 is judged to be yes;
step S408: determining a target symbol from the character string to be recognized based on a matching result;
step S409: and determining that the first symbol of the character string to be recognized is not the target symbol.
In some embodiments, the preset dictionary is generated by:
step D1: acquiring mapping relation information between symbols and target coding units;
step D2: and generating a dictionary corresponding to the type according to the type of each symbol based on the mapping relation information.
In some embodiments, the type of the symbol may include, but is not limited to, a single-character symbol type, a combined symbol type, an emoticon zero-width hyphen sequence type, an emoticon modification sequence type, and the like, and each type may correspond to a specific dictionary, so that the reference efficiency of the dictionary can be improved.
Referring to fig. 5, fig. 5 shows that a method 500 for generating a preset dictionary according to an embodiment of the present disclosure includes:
step S501: reading an emoticon data file, wherein the emoticon data file comprises mapping relation information between the emoticons and a target coding unit;
step S502: judging whether unread mapping relation information exists in the data file or not; if yes, go to step S503; if not, go to step S505;
step S503: reading mapping relation information;
step S504: determining the type of the symbol, and storing the mapping relation information into a dictionary corresponding to the type based on the type of the symbol;
and after the step S504 is finished, the steps S502-S504 are repeatedly executed until all the mapping relation information of the expression symbol data file is read.
Accordingly, as shown in fig. 6, there is provided a symbol processing apparatus 600 according to an embodiment of the present disclosure, including:
a to-be-recognized character obtaining unit 620, configured to obtain a to-be-recognized character string, where the to-be-recognized character string corresponds to at least one target encoding unit;
a character to be recognized determining unit 640, configured to determine the number of the first target coding unit corresponding to the character string to be recognized and the number of the target coding units corresponding to the character string to be recognized;
a candidate symbol determining unit 660, configured to determine a candidate symbol set from a preset dictionary, where the candidate symbol set includes at least one candidate symbol; the first target coding unit corresponding to the candidate symbol is the same as the first target coding unit corresponding to the character string to be identified, and the number of the target coding units corresponding to the candidate symbol is not more than the number of the target coding units corresponding to the character string to be identified;
a matching unit 680, configured to match the candidate symbols in the candidate symbol set with the character string to be recognized, and determine a target symbol from the character string to be recognized based on a matching result.
According to one or more embodiments of the disclosure, the number of first target coding units corresponding to a character string to be recognized and the number of target coding units corresponding to the character string to be recognized are determined, candidate symbols which are the same as the first target coding units of the character string to be recognized and the number of target coding units is smaller than the number of target coding units of the character string to be recognized are determined from a preset dictionary, each candidate symbol is matched with the character string to be recognized, and the target symbols are determined from the character string to be recognized based on the matching result, so that the target characters can be recognized from the character string efficiently and accurately, and further, the problems of webpage layout abnormality, cursor position rendering error and the like caused by target symbol recognition errors can be avoided.
In some embodiments, the target symbols are combined symbols, one of the combined symbols comprising at least two target coding units.
In some embodiments, the predetermined dictionary is dedicated to recording symbols of the same type as the target symbol.
In some embodiments, the matching unit is further configured to split the target symbol from the string to be recognized.
In some embodiments, the combined symbol comprises a combined emoticon.
In some embodiments, the symbol processing apparatus further comprises:
and the symbol determining unit is used for determining that the target coding unit corresponding to the character string to be identified comprises at least one emoticon.
In some embodiments, the matching unit is configured to sequentially match the candidate symbols in the candidate symbol set with the character string to be recognized in an order from a greater number to a smaller number of the corresponding target coding units.
In some embodiments, the matching the candidate symbols in the candidate symbol set with the character string to be recognized includes the following matching steps:
determining a numerical value M of a target coding number corresponding to a candidate symbol with a larger target coding unit number Nth in the candidate symbol set, and intercepting the first M target coding units from the character string to be recognized as a first character string to be recognized to be matched with the candidate symbols in the candidate symbol set; wherein the initial value of N is 1;
if a first candidate symbol identical to the first character string to be recognized exists in the candidate symbol set, taking the first candidate symbol as the matching result;
and if the first candidate symbol which is the same as the first character string to be recognized does not exist in the candidate symbol set, adding 1 to the value of N and repeatedly executing the matching step.
In some embodiments, a keyword of the preset dictionary is a target coding unit, and a value corresponding to the keyword is at least one symbol that takes the keyword as a first target coding unit.
In some embodiments, the symbol processing apparatus further comprises:
a first encoding determination unit for determining whether the first target coding unit is a first coding unit; the candidate symbol determining unit is configured to perform the step of determining a candidate symbol set from a preset dictionary if the first target encoding unit is not the first encoding unit;
and the determining unit is used for determining that the starting symbol of the character string to be identified is not the target symbol if the first target coding unit is the first coding unit.
In some embodiments, the first coding unit is a target coding unit for only one character symbol, the one character symbol consisting of one target coding unit; the determining whether the first target coding unit is the first coding unit comprises: determining whether the first target coding unit is a first coding unit based on a preset first coding unit set; wherein the first set of coding units comprises at least one of the first coding units.
In some embodiments, the target coding unit comprises a Unicode scalar or character.
In some embodiments, the symbol processing apparatus further includes a dictionary generating unit configured to obtain mapping relationship information between the included symbols and the target encoding unit, and generate a dictionary corresponding to a type of each symbol according to the type based on the mapping relationship information.
For the embodiments of the apparatus, since they correspond substantially to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described apparatus embodiments are merely illustrative, in that modules illustrated as separate modules may or may not be separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Accordingly, in accordance with one or more embodiments of the present disclosure, there is provided an electronic device including:
at least one memory and at least one processor;
the memory is used for storing program codes, and the processor is used for calling the program codes stored by the memory to enable the electronic equipment to execute the symbol processing method provided by one or more embodiments of the disclosure.
Accordingly, according to one or more embodiments of the present disclosure, there is provided a non-transitory computer storage medium storing program code executable by a computer device to cause the computer device to perform a symbol processing method provided according to one or more embodiments of the present disclosure.
Referring now to fig. 7, shown is a schematic block diagram of an electronic device (e.g., a terminal device or server) 800 suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 7, an electronic device 800 may include a processing means (e.g., central processing unit, graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing apparatus 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
Generally, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, or the like; storage 808 including, for example, magnetic tape, hard disk, etc.; and a communication device 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 800 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 809, or installed from the storage means 808, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods of the present disclosure as described above.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Wherein the name of an element does not in some cases constitute a limitation on the element itself.
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure, there is provided a symbol processing method including: acquiring a character string to be identified, wherein the character string to be identified corresponds to at least one target coding unit; determining the number of first target coding units corresponding to the character strings to be identified and the number of target coding units corresponding to the character strings to be identified; determining a candidate symbol set from a preset dictionary, wherein the candidate symbol set comprises at least one candidate symbol; the first target coding unit corresponding to the candidate symbol is the same as the first target coding unit corresponding to the character string to be identified, and the number of the target coding units corresponding to the candidate symbol is not more than the number of the target coding units corresponding to the character string to be identified; and matching the candidate symbols in the candidate symbol set with the character string to be recognized, and determining a target symbol from the character string to be recognized based on a matching result.
According to one or more embodiments of the present disclosure, the target symbols are combined symbols, one of the combined symbols including at least two target coding units.
According to one or more embodiments of the present disclosure, the predetermined dictionary is dedicated to recording symbols of the same type as the target symbol.
According to one or more embodiments of the present disclosure, determining a target symbol from the character string to be recognized includes: and splitting the target symbol from the character string to be recognized.
According to one or more embodiments of the present disclosure, the combination symbol includes a combination emoticon.
According to one or more embodiments of the present disclosure, before determining the candidate symbol set from the preset dictionary, the symbol processing method further includes: and determining that the target coding unit corresponding to the character string to be identified comprises at least one emoticon.
According to one or more embodiments of the present disclosure, the matching the candidate symbols in the candidate symbol set with the character string to be recognized includes: and matching the candidate symbols in the candidate symbol set with the character string to be recognized in sequence from more to less based on the number of the corresponding target coding units.
According to one or more embodiments of the present disclosure, the matching the candidate symbols in the candidate symbol set with the character string to be recognized includes: determining a numerical value M of a target coding number corresponding to a candidate symbol with a larger target coding unit number Nth in the candidate symbol set, and intercepting the first M target coding units from the character string to be recognized as a first character string to be recognized to be matched with the candidate symbols in the candidate symbol set; wherein the initial value of N is 1; if a first candidate symbol identical to the first character string to be recognized exists in the candidate symbol set, taking the first candidate symbol as the matching result; and if the first candidate symbol which is the same as the first character string to be recognized does not exist in the candidate symbol set, adding 1 to the value of N and repeatedly executing the matching step.
According to one or more embodiments of the present disclosure, a keyword of the preset dictionary is a target coding unit, and a value corresponding to the keyword is at least one symbol which takes the keyword as a first target coding unit.
According to one or more embodiments of the present disclosure, the symbol processing method further includes: determining whether the first target coding unit is a first coding unit before the step of determining a candidate symbol set from a preset dictionary is performed; if the first target coding unit is not the first coding unit, executing the step of determining a candidate symbol set from a preset dictionary; and if the first target coding unit is the first coding unit, determining that the initial symbol of the character string to be recognized is not the target symbol.
According to one or more embodiments of the present disclosure, the first coding unit is a target coding unit only for a single character symbol, the single character symbol being composed of one target coding unit; the determining whether the first target coding unit is the first coding unit comprises: determining whether the first target coding unit is a first coding unit based on a preset first coding unit set; wherein the first set of coding units comprises at least one of the first coding units.
According to one or more embodiments of the present disclosure, the target coding unit includes a Unicode scalar or character.
According to one or more embodiments of the present disclosure, the preset dictionary is generated by: acquiring mapping relation information between symbols and target coding units; and generating a dictionary corresponding to the type according to the type of each symbol based on the mapping relation information.
According to one or more embodiments of the present disclosure, there is provided a symbol processing apparatus including: the device comprises a character to be recognized acquisition unit, a character recognition unit and a character recognition unit, wherein the character to be recognized acquisition unit is used for acquiring a character string to be recognized, and the character string to be recognized corresponds to at least one target coding unit; the character to be recognized determining unit is used for determining the number of the first target coding unit corresponding to the character string to be recognized and the number of the target coding units corresponding to the character string to be recognized; the candidate symbol determining unit is used for determining a candidate symbol set from a preset dictionary, wherein the candidate symbol set comprises at least one candidate symbol; the first target coding unit corresponding to the candidate symbol is the same as the first target coding unit corresponding to the character string to be identified, and the number of the target coding units corresponding to the candidate symbol is not more than the number of the target coding units corresponding to the character string to be identified; and the matching unit is used for matching the candidate symbols in the candidate symbol set with the character string to be recognized and determining target symbols from the character string to be recognized based on the matching result. According to one or more embodiments of the present disclosure, there is provided an electronic device including: at least one memory and at least one processor; wherein the memory is configured to store program code, and the processor is configured to call the program code stored in the memory to cause the electronic device to execute a symbol processing method provided according to one or more embodiments of the present disclosure.
According to one or more embodiments of the present disclosure, there is provided a non-transitory computer storage medium storing program code which, when executed by a computer device, causes the computer device to perform a symbol processing method provided according to one or more embodiments of the present disclosure.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other combinations of features described above or equivalents thereof without departing from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (16)

1. A method of symbol processing, comprising:
acquiring a character string to be identified, wherein the character string to be identified corresponds to at least one target coding unit;
determining the number of first target coding units corresponding to the character strings to be identified and the number of target coding units corresponding to the character strings to be identified;
determining a candidate symbol set from a preset dictionary, wherein the candidate symbol set comprises at least one candidate symbol; the first target coding unit corresponding to the candidate symbol is the same as the first target coding unit corresponding to the character string to be identified, and the number of the target coding units corresponding to the candidate symbol is not more than the number of the target coding units corresponding to the character string to be identified;
and matching the candidate symbols in the candidate symbol set with the character string to be recognized, and determining a target symbol from the character string to be recognized based on a matching result.
2. The symbol processing method according to claim 1,
the target symbols are combined symbols, one of which comprises at least two target coding units.
3. The symbol processing method according to claim 1,
the preset dictionary is specially used for recording symbols of the same type as the target symbols.
4. The symbol processing method according to claim 1, wherein determining a target symbol from the string of characters to be recognized comprises:
and splitting the target symbol from the character string to be recognized.
5. The symbol processing method according to claim 2,
the combined symbol comprises a combined emoticon.
6. The symbol processing method according to claim 1, further comprising, before determining the set of candidate symbols from a preset dictionary:
and determining that the target coding unit corresponding to the character string to be identified comprises at least one emoticon.
7. The symbol processing method according to claim 1, wherein said matching candidate symbols in the candidate symbol set with the character string to be recognized comprises:
and matching the candidate symbols in the candidate symbol set with the character string to be recognized in sequence from more to less based on the number of the corresponding target coding units.
8. The symbol processing method according to claim 1, wherein said matching candidate symbols in the candidate symbol set with the character string to be recognized comprises the following matching steps:
determining a numerical value M of a target coding number corresponding to a candidate symbol with a larger target coding unit number Nth in the candidate symbol set, and intercepting the first M target coding units from the character string to be recognized as a first character string to be recognized to be matched with the candidate symbols in the candidate symbol set; wherein the initial value of N is 1;
if a first candidate symbol identical to the first character string to be recognized exists in the candidate symbol set, taking the first candidate symbol as the matching result;
and if the first candidate symbol which is the same as the first character string to be recognized does not exist in the candidate symbol set, adding 1 to the value of N and repeatedly executing the matching step.
9. The symbol processing method according to claim 1,
and the keywords of the preset dictionary are target coding units, and the values corresponding to the keywords are at least one symbol which takes the keywords as a first target coding unit.
10. The symbol processing method according to claim 1, further comprising:
determining whether the first target coding unit is a first coding unit before the step of determining a candidate symbol set from a preset dictionary is performed;
if the first target coding unit is not the first coding unit, executing the step of determining a candidate symbol set from a preset dictionary;
and if the first target coding unit is the first coding unit, determining that the starting symbol of the character string to be identified is not the target symbol.
11. The symbol processing method according to claim 10,
the first coding unit is a target coding unit only used for single character symbols, and the single character symbols consist of one target coding unit;
the determining whether the first target coding unit is the first coding unit comprises: determining whether the first target coding unit is a first coding unit based on a preset first coding unit set; wherein the first set of coding units comprises at least one of the first coding units.
12. The symbol processing method according to claim 1,
the target coding unit includes a Unicode scalar or character.
13. The symbol processing method according to claim 1, wherein the preset dictionary is generated by:
acquiring mapping relation information between symbols and target coding units;
and generating a dictionary corresponding to the type according to the type of each symbol based on the mapping relation information.
14. A symbol processing apparatus, characterized by comprising:
the device comprises a character to be recognized acquisition unit, a character recognition unit and a character recognition unit, wherein the character to be recognized acquisition unit is used for acquiring a character string to be recognized, and the character string to be recognized corresponds to at least one target coding unit;
the character to be recognized determining unit is used for determining the number of the first target coding unit corresponding to the character string to be recognized and the number of the target coding units corresponding to the character string to be recognized;
the candidate symbol determining unit is used for determining a candidate symbol set from a preset dictionary, wherein the candidate symbol set comprises at least one candidate symbol; the first target coding unit corresponding to the candidate symbol is the same as the first target coding unit corresponding to the character string to be identified, and the number of the target coding units corresponding to the candidate symbol is not more than the number of the target coding units corresponding to the character string to be identified;
and the matching unit is used for matching the candidate symbols in the candidate symbol set with the character string to be recognized and determining target symbols from the character string to be recognized based on the matching result.
15. An electronic device, comprising:
at least one memory and at least one processor;
wherein the memory is configured to store program code, and the processor is configured to call the program code stored in the memory to cause the electronic device to perform the method of any of claims 1 to 13.
16. A non-transitory computer storage medium, characterized in that,
the non-transitory computer storage medium stores program code that, when executed by a computer device, causes the computer device to perform the method of any of claims 1 to 13.
CN202210293122.9A 2022-03-23 2022-03-23 Symbol processing method, device, electronic equipment and storage medium Pending CN114638218A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210293122.9A CN114638218A (en) 2022-03-23 2022-03-23 Symbol processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210293122.9A CN114638218A (en) 2022-03-23 2022-03-23 Symbol processing method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114638218A true CN114638218A (en) 2022-06-17

Family

ID=81950126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210293122.9A Pending CN114638218A (en) 2022-03-23 2022-03-23 Symbol processing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114638218A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796705A (en) * 2023-08-09 2023-09-22 腾讯科技(深圳)有限公司 Method and device for detecting expression, electronic equipment and storage medium
WO2024046275A1 (en) * 2022-09-02 2024-03-07 华为技术有限公司 Display method and electronic device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024046275A1 (en) * 2022-09-02 2024-03-07 华为技术有限公司 Display method and electronic device
CN116796705A (en) * 2023-08-09 2023-09-22 腾讯科技(深圳)有限公司 Method and device for detecting expression, electronic equipment and storage medium
CN116796705B (en) * 2023-08-09 2024-03-12 腾讯科技(深圳)有限公司 Method and device for detecting expression, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111176996A (en) Test case generation method and device, computer equipment and storage medium
CN114638218A (en) Symbol processing method, device, electronic equipment and storage medium
CN111314388B (en) Method and apparatus for detecting SQL injection
CN113407814B (en) Text searching method and device, readable medium and electronic equipment
CN112287206A (en) Information processing method and device and electronic equipment
CN115640815A (en) Translation method, translation device, readable medium and electronic equipment
CN111597107A (en) Information output method and device and electronic equipment
JP2022017173A (en) Method and device for outputting information, electronic device, computer-readable storage medium, and computer program
CN111240962B (en) Test method, test device, computer equipment and computer storage medium
CN115618808A (en) Document typesetting method and device, electronic equipment and storage medium
CN113807056B (en) Document name sequence error correction method, device and equipment
US8577861B2 (en) Apparatus and method for searching information
CN112487765B (en) Method and device for generating notification text
CN111737571B (en) Searching method and device and electronic equipment
CN111782895B (en) Retrieval processing method and device, readable medium and electronic equipment
CN111339776B (en) Resume parsing method and device, electronic equipment and computer-readable storage medium
CN111221424B (en) Method, apparatus, electronic device, and computer-readable medium for generating information
CN114429629A (en) Image processing method and device, readable storage medium and electronic equipment
CN112509581A (en) Method and device for correcting text after speech recognition, readable medium and electronic equipment
CN108734149B (en) Text data scanning method and device
CN110780898A (en) Page data upgrading method and device and electronic equipment
CN115374320B (en) Text matching method and device, electronic equipment and computer medium
CN113032808B (en) Data processing method and device, readable medium and electronic equipment
CN117709364A (en) Text processing method, device, electronic equipment and storage medium
CN106844783B (en) Information processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination