CN113553832A - Word processing method and device, electronic equipment and computer readable storage medium - Google Patents

Word processing method and device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN113553832A
CN113553832A CN202010328148.3A CN202010328148A CN113553832A CN 113553832 A CN113553832 A CN 113553832A CN 202010328148 A CN202010328148 A CN 202010328148A CN 113553832 A CN113553832 A CN 113553832A
Authority
CN
China
Prior art keywords
character
candidate
text
characters
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010328148.3A
Other languages
Chinese (zh)
Other versions
CN113553832B (en
Inventor
包祖贻
李辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010328148.3A priority Critical patent/CN113553832B/en
Publication of CN113553832A publication Critical patent/CN113553832A/en
Application granted granted Critical
Publication of CN113553832B publication Critical patent/CN113553832B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a word processing method and device, electronic equipment and a computer readable storage medium. The method comprises the following steps: performing character segmentation on a text to be processed to obtain a plurality of characters; obtaining a plurality of candidate character units from a plurality of characters, wherein each candidate character unit comprises at least one Chinese character; generating a plurality of candidate texts for one or more units in the plurality of candidate word units, and calculating confidence degrees of the plurality of candidate texts; corrected text is selected for one or more cells based on the confidence level. According to the method and the device, the confidence coefficient of each Chinese character to the semantics of the input text is calculated based on each Chinese character divided from the input text and the expanded candidate block of each Chinese character, and the candidate block with the highest confidence coefficient in the semantics of the input text is selected to correct the corresponding character in the input text, so that the accuracy of character correction can be improved.

Description

Word processing method and device, electronic equipment and computer readable storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing a word, an electronic device, and a computer-readable storage medium.
Background
As electronic products widely enter various fields of life and work of people, the character input function of the products becomes more and more important. When inputting character information through various input tools, because of the characteristics of homonyms, shape-near characters, sound-near characters and the like of Chinese characters, various spelling errors can be generated in Chinese input. In many text use occasions, the tolerance rate for wrong text is also low, so that the text input by the electronic equipment needs to be corrected frequently.
In the prior art, such proofreading is usually performed manually, so that the efficiency is very low, and the requirement of the internet for a large amount of information is difficult to meet.
Disclosure of Invention
The embodiment of the application provides a word processing method and device, electronic equipment and a computer readable storage medium, so as to overcome the defect of low word proofreading efficiency in the prior art.
To achieve the above object, an embodiment of the present application provides a word processing method, including:
performing character segmentation on a text to be processed to obtain a plurality of characters;
obtaining a plurality of candidate character units in the characters, wherein each candidate character unit comprises at least one Chinese character;
generating a plurality of candidate texts for one or more units in the plurality of candidate word units, and calculating confidence degrees of the plurality of candidate texts;
selecting corrected text for the one or more cells based on the confidence level.
An embodiment of the present application further provides a word processing device, including:
the segmentation module is used for performing character segmentation on the text to be processed to obtain a plurality of characters;
a first obtaining module, configured to obtain a plurality of candidate word units from the plurality of characters, where each candidate word unit includes at least one chinese character;
a first generation module for generating a plurality of candidate texts for one or more units of the plurality of candidate word units;
the first calculation module is used for calculating the confidence degrees of a plurality of candidate texts;
a first selection module to select corrected text for the one or more cells based on the confidence.
An embodiment of the present application further provides an electronic device, including:
a memory for storing a program;
and the processor is used for operating the program stored in the memory, and the program executes the word processing method when running.
The embodiment of the application also provides a computer readable storage medium, on which a computer program executable by a processor is stored, wherein the program realizes the word processing method as described above when being executed by the processor.
According to the word processing method and device, the electronic device and the computer readable storage medium, the confidence coefficient of each character to the semantics of the input text is calculated based on each character divided from the input text and the expanded candidate word unit of each character, and the candidate word block with the highest confidence coefficient on the semantics of the input text is selected to correct the corresponding character in the input text, so that the advantages of two schemes of word recognition and semantic recognition of word granularity can be integrated, and the accuracy of word correction can be improved.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a schematic view of an application scenario of a word processing method according to an embodiment of the present application;
FIG. 2 is a flow diagram of one embodiment of a word processing method provided herein;
FIG. 3 is a flow diagram of another embodiment of a word processing method provided by the present application;
FIG. 4 is a schematic structural diagram of an embodiment of a word processing device according to the present application;
fig. 5 is a schematic structural diagram of an embodiment of an electronic device provided in the present application.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Example one
The scheme provided by the embodiment of the application can be applied to any word processing system with data processing capability. Fig. 1 is a schematic view of an application scenario of a word processing method provided in an embodiment of the present application, and the scenario shown in fig. 1 is only one example of a scenario in which the technical solution of the present application may be applied.
With the development of computer technology, more and more users choose to use computers as tools for work, particularly with computers for performing work traditionally made of manually written text. For example, as shown in fig. 1, users as employees or staff of a company are required to write various reports at work, or users as lawyers are required to write legal documents for their clients, and these writing works are currently accomplished by using computer technology, and even with the development of voice technology, a scheme has emerged in which a computer can convert the user's voice input into text by simply speaking into an audio capture device. Therefore, in this case, due to the complexity of the characters, for example, polyphonic characters, homophones, even characters with a similar shape, etc., the user may be caused to input wrong characters when inputting characters through various input devices. Such incorrect words are often disadvantageous for the purpose of the text being composed by the user. For example, as described above, if there are wrongly written words in the work reports submitted by the user in the work in fig. 1, the user may give a bad impression to the leaders and even the clients who review the work reports, while the legal work performed by the user is less tolerant to the wrongly written words, and a word error may cause the service failure of the user for the client and even result in huge compensation.
Therefore, in order to eliminate the adverse effect caused by the wrong text, a special proofreading person is usually arranged to perform a text proofreading operation on the input text after the user inputs the first draft, so as to ensure the correctness of the input text. However, such manual proofreading is not only costly but also inefficient. Therefore, word-granularity-based word checking schemes have been developed in the prior art, but depending on the accuracy of word segmentation, the word checking schemes have higher accuracy in the spelling check of western languages such as english, but chinese texts are usually not segmented, and erroneous chinese words often cause word segmentation errors, which further affects the accuracy of word-granularity-based word checking schemes. Therefore, for character inspection of a Chinese language, a character inspection scheme based on character granularity appears in the prior art, and the problem of low inspection accuracy caused by wrong word segmentation is solved by inspecting characters based on characters as units.
Thus, in embodiments of the present application, after a user enters various text, the text is entered into a word processing system for processing. In the word processing system, word blocks are defined as a unit for word check for each text, and a word-based or word-based check base in the prior art is not used, in other words, in the present application, single-word words, i.e., words, and multi-word words, i.e., words, are all processed as word blocks, so that it is possible to sufficiently use semantic information at a word level to check and correct erroneous words. For example, in the embodiment of the present application, the input text may be divided by word granularity in units of characters, so as to obtain all characters of the text. In this application, consecutive numeric and english characters are also treated as a whole respectively. For example, 12 would be identified as 12 instead of two separate numbers 1 and 2, and the KFC could be identified as one abbreviated word "KFC" instead of the three english letters "K", "F" and "C". Therefore, through the division in character units, individual characters of the input text can be obtained, including individual chinese characters and numeric and english characters. On the basis of the Chinese characters, all possible candidate words are generated, wherein the words can be single-word words or multi-word words. For example, if the text input by the user includes "me", "jing", "day", "11", and "dot", association and expansion may be performed based on four chinese characters in the input text, respectively, to form various words respectively related to the four characters as candidate words. For example, in the embodiment of the present application, such expansion may be performed by selecting various characters similar to the same pronunciation or shape thereof by using the pinyin confusion set and the font confusion set. For example, for these four words, an expansion word such as "grip", "hug", "sunny", "today", "shop" or the like may be generated as a candidate word. In other words, the present application uses homophones or approximations to make associations and guesses, thereby generating a set of candidate words, which may include the correct word "today" for "day of kyoto", i.e., correct "kyoto" to be the candidate word "today". Thereafter, character discrimination calculations may be performed based on such blocks. For example, the correct text may be determined by performing various combinations for each candidate word in the set of candidate words according to the position of the character in the input text corresponding to the candidate word, and calculating the semantic confidence for each of the combinations. In some embodiments, in order to speed up the search, for each candidate word in the candidate word set, the confidence level of the candidate word in the input text may also be calculated, for example, the candidate words may be substituted one by one according to the word order of the input text, and the confidence level of the text after the candidate word is substituted is calculated to determine whether the text has reasonable semantics, so that unreasonable candidate words, that is, candidate words that are not suitable for the semantics of the input text, are excluded by the confidence level, and therefore, candidate words that are suitable for the semantics of the input text may be finally selected as the corrected words.
Therefore, by calculating the confidence of each character with respect to the semantics of the input text based on each character divided from the input text and the expanded candidate character unit based on each character, and selecting the candidate block with the highest confidence in the semantics of the input text to correct the corresponding character in the input text, the advantages of both the character recognition and the semantic recognition of the character granularity can be integrated, so that the accuracy of character correction can be improved.
The above embodiments are illustrations of technical principles and exemplary application frameworks of the embodiments of the present application, and specific technical solutions of the embodiments of the present application are further described in detail below through a plurality of embodiments.
Example two
Fig. 2 is a flowchart of an embodiment of a word processing method provided in the present application, and an execution subject of the method may be various terminal or server devices with data processing capability, or may be a device or chip integrated on these devices. As shown in fig. 2, the word processing method includes the following steps:
s201, performing character segmentation on the text to be processed to obtain a plurality of characters.
In the embodiment of the present application, various text information may be acquired from various text data sources. For example, various text contents can be input by a user through an input tool such as a keyboard, or voice can be input through a voice acquisition device and the voice input by the user is converted into text contents through a voice recognition program. For another example, the text to be collated may also be acquired from the internet as the text to be processed in the present application.
In the case that the text to be processed is obtained, in step S201 of the embodiment of the present application, the text to be processed may be divided by taking a character as a unit, so as to obtain each character in the text to be processed. Generally, the text information input by the user may include various characters, such as chinese, english, numerals, special characters, and so on. In the embodiment of the present application, characters and characters of various languages may be processed, and as described above, characters of other western and european languages, such as english, may be processed based on characters and segmented, so that correction may be performed using various technical means in the prior art, and of course, may be processed using the scheme of the present application. Here, the technical solution of the present application can be described by taking chinese characters as an example. In the embodiment of the present application, each character in the text to be processed may be divided, and in particular, consecutive digits, for example, in the case of representing a number more than ten in two digits, and consecutive english characters or characters in western language may be divided into a whole to represent an english word.
For example, if the user inputs "i beijing day 11 dots" as the text to be processed, the text to be processed may be divided into several characters of "i", "beijing", "day", "11", and "dots" in units of characters in step S201.
S202, a plurality of candidate character units are obtained from a plurality of characters.
After the respective characters are obtained in step S201, candidate word units of the chinese character may be obtained based on the chinese characters in the characters, for example, each candidate word unit may include at least one chinese character. Here, the candidate character unit of each chinese character in the text to be processed may be a candidate word obtained by associating and expanding according to the chinese character. For example, in the embodiment of the present application, association and expansion may be performed by using pronunciation, font, and the like to obtain a candidate word for each chinese character. For example, in a case where the user inputs "i jingtian 11 o" as the text to be processed, a pronunciation association may be made in step S202 for the chinese character "i" obtained in step S201, for example, the equivalent phonetic characters of "handshake", "wo", and "i", and a pronunciation association may be made for the chinese character "jing", for example, the candidate characters of "tight", "today", "fine", "jin" may be obtained. In addition, characters such as the Chinese character 'surprise' similar to the Chinese character 'jing' in the text to be processed in a font mode can be obtained as candidate words of the Chinese character 'jing'.
Besides obtaining the individual characters corresponding to the kanji characters in the text to be processed as candidate words, the obtained individual characters may be combined according to the position order of the corresponding characters in the corresponding text to be processed to obtain combined candidate words. For example, the candidate word "holding" corresponding to "me" and the candidate words "tight", "today", "fine", "jin" corresponding to "me" may be combined to obtain combined candidate words such as "holding", "today", "my tight", and the like.
In addition, in the embodiment of the present application, the number of candidate words to be combined may be set arbitrarily according to the actual situation. For example, 2 may be set so that two single words at the maximum can be combined, or 3 may be set so that three single words at the maximum can be combined to form a candidate word.
Therefore, in step S202, each kanji character obtained by character-dividing the text to be processed is expanded, so that a plurality of candidate words corresponding to each kanji character are obtained as candidate character units. The candidate word unit can at least comprise one Chinese character according to actual needs.
S203, generating a plurality of candidate texts for one or more units in the plurality of candidate word units, and calculating the confidence degrees of the plurality of candidate texts.
In the embodiment of the present application, after candidate words, i.e., candidate word units, corresponding to each kanji character in the text to be processed are obtained in step S202, candidate texts may be generated based on the candidate words in step S203. For example, the candidate words may be combined to generate the candidate text according to the word position order of the chinese characters in the text to be processed corresponding to the candidate words. For example, in a case where the user inputs "i jingtian 11 dots" as the text to be processed, "holding", "wo", and "i" are obtained as candidate words for the kanji character "i" in the first character information, and "close", "today", "fine", "jin", "surprise" are obtained as candidate words for the kanji character "jing", and "paste", "day" is obtained as candidate words for the kanji character "day", and also combined candidate words such as "holding", "my today", "i's tight" are obtained. Therefore, the candidate words obtained in step S202 may be combined according to the bit position order of the three characters "i", "jing", "day" in the to-be-processed text corresponding to the candidate words, so as to obtain a combination of the candidate words such as "grip day", "jingji", "i hug", "wo", "i today", and the like as the candidate text. Thus, confidence may be calculated for these candidate texts, i.e. the combination of candidate words. In the embodiment of the present application, the confidence of the candidate text may be a semantic confidence of the candidate text, for example, a semantic rationality of the candidate text. For example, the confidence for "i'm hug" and "i'm today" in the candidate texts is high, in other words, the two candidate texts are semantically reasonable. In contrast, the confidence of "Beijing paste" or "Hold the day" or "Wo today" is low, i.e., meaning semantically unreasonable.
In step S203, all candidate words of all kanji characters in the text to be processed may be combined in the order of character positions, and corresponding confidence degrees may be calculated.
S204, selecting the corrected text for one or more units according to the confidence.
In step S204, a selection may be made according to the confidence level of each candidate text calculated in step S203 to determine candidate texts that can be used as corrected texts of the text to be processed. For example, when all candidate texts are acquired and the corresponding confidence levels are calculated in step S203, the candidate texts may be selected according to the confidence levels in step S204. For example, the confidence levels may be ranked, and the candidate text with the highest confidence level is selected as the corrected text of the text to be processed according to the ranking result. And the corrected text may then be output for presentation to the user or to directly replace the erroneous text entered by the user.
According to the word processing method, the confidence coefficient of each character to the semantics of the input text is calculated based on each character divided from the input text and the expanded candidate word unit based on each character, and the candidate word block with the highest confidence coefficient in the semantics of the input text is selected to correct the corresponding character in the input text, so that the advantages of two schemes of word recognition and semantic recognition of word granularity can be integrated, and the accuracy of word correction can be improved.
EXAMPLE III
Fig. 3 is a flowchart of another embodiment of the word processing method provided in the present application, and the execution subject of the method may be various terminal or server devices with word processing capability, or may be a device or chip integrated on these devices. As shown in fig. 3, the word processing method includes the following steps:
s301, performing character segmentation on the text to be processed to obtain a plurality of characters.
In the embodiment of the present application, various text information may be acquired from various text data sources. For example, various text contents can be input by a user through an input tool such as a keyboard, or voice can be input through a voice acquisition device and the voice input by the user is converted into text contents through a voice recognition program. For another example, the text to be collated may also be acquired from the internet as the text to be processed in the present application.
In the case that the text to be processed is obtained, in step S301 of the embodiment of the present application, the text to be processed may be divided by taking a character as a unit, so as to obtain each character in the text to be processed. Generally, the text information input by the user may include various characters, such as chinese, english, numerals, special characters, and so on. In the embodiment of the present application, characters and characters of various languages may be processed, and as described above, characters of other western and european languages, such as english, may be processed based on characters and segmented, so that correction may be performed using various technical means in the prior art, and of course, may be processed using the scheme of the present application. Here, the technical solution of the present application can be described by taking chinese characters as an example. In the embodiment of the present application, each character in the text to be processed may be divided, and in particular, consecutive digits, for example, in the case of representing a number more than ten in two digits, and consecutive english characters or characters in western language may be divided into a whole to represent an english word.
For example, if the user inputs "i beijing day 11 dots" as the text to be processed, the text to be processed may be divided into several characters of "i", "beijing", "day", "11", and "dots" in units of characters in step S301.
S302, according to the pronunciation of a character, obtaining a second character which is the same as or similar to the pronunciation of the character as a candidate character unit; and/or obtaining a second character combination which is the same as or similar to the pronunciation of the character combination as a candidate character unit according to the pronunciation of a predetermined number of adjacent character combinations in the text to be processed.
After the individual characters are obtained in step S301, a candidate character unit for each kanji character may be obtained based on the character. Here, the candidate word unit of each character in the text to be processed may be a candidate word obtained by associating and expanding according to the character. For example, in the embodiment of the present application, association and expansion may be performed in a pronunciation manner to obtain a candidate word for each kanji character. For example, in the embodiment of the present application, a pinyin confusion set or other pinyin database may be used to obtain corresponding candidate words according to the pronunciation of a kanji character. Specifically, characters having the same or similar pronunciation as the character may be obtained from the pinyin confusion set as candidate word units according to the pronunciation of each kanji character of the text to be processed input by the user. For example, in a case where the user inputs "i jingtian 11 dots" as the text to be processed, a pronunciation association may be made in step S302 for the character "i" obtained in step S301, for example, "handshake", "wo", and "i" equivalent phonetic characters may be obtained, and a pronunciation association may be made for the character "jing", for example, "tight", "today", "fine", "jin" waiting for selection characters may be obtained as candidate words.
In addition, in the embodiment of the present application, in addition to obtaining individual characters corresponding to individual characters in a text to be processed as candidate words, chinese character combinations having the same or similar pronunciations may be obtained according to the pronunciations of a predetermined number of adjacent character combinations in the text to be processed to serve as candidate words, and for example, combination candidate words such as "grip", "hold", "my present", and "my present" may be obtained for a chinese character combination "i jing".
In addition, in the embodiment of the present application, the number of characters in the character combination close in position may be arbitrarily set according to the actual situation. For example, 2 may be set so that a candidate word of the same tone or similar tone corresponding to a combination of two characters can be obtained at most, or 3 may be set so that a candidate word of the same tone or similar tone corresponding to a combination of three characters can be obtained at most.
S303, obtaining a third character which is the same as or similar to the font of the character as a candidate character unit according to the font of the character; and/or obtaining a third character combination which is the same as or similar to the character pattern of the character combination as a candidate character unit according to the character patterns of the adjacent preset number of character combinations in the text to be processed.
Corresponding to step S302, a candidate word may also be obtained according to the font of each chinese character in the text to be processed input by the user. For example, in the embodiment of the present application, a font confusion set or other font database may be utilized to obtain a corresponding candidate word according to the font of a character. Specifically, characters having similar glyphs to the characters may be obtained from the glyph confusion set as candidate word units according to the glyph of one character of the text to be processed input by the user. For example, in a case where the user inputs "i jingtian 11 o" as the text to be processed, font association may be performed in step S303 for the character "jing" obtained in step S301, and for example, words with similar fonts, such as "surprise", "race", and the like, may be obtained as candidate words.
Further, in the embodiment of the present application, in addition to obtaining individual characters corresponding to individual characters in a text to be processed as candidate words, a character combination having an approximate font may be obtained from the font of a combination of individual characters and characters that are close in character position in the text to be processed as candidate words, for example, a combination candidate word such as "a flat stone with ropes attached at the sides surprise", "my beijing" may be obtained for the character "my beijing".
In addition, in the embodiment of the present application, the number of characters in the character combination close in position may be arbitrarily set according to the actual situation. For example, 2 may be set so that candidate words with similar fonts corresponding to combinations of two characters can be obtained at most, or 3 may be set so that candidate words with similar fonts corresponding to combinations of three characters can be obtained at most.
S304, according to a preset sequence, splicing one or more units in the candidate character units to obtain a plurality of candidate texts.
In the embodiment of the present application, after candidate words, i.e., candidate word units, corresponding to each kanji character in the text to be processed are obtained in steps S302 and/or S303, candidate texts may be generated based on the candidate words in step S304. For example, the first candidate text may be assembled by stitching from one or more of the selected candidate word units in a predetermined order. For example, in a case where the user inputs "i jingtian 11 o" as the text to be processed, one or two, or even three characters may be selected from the first character "i" on the left side, and candidate word units corresponding to the selected several characters among the candidate word units are selected accordingly to be spliced into the candidate text. For example, when "me" and "jing" are selected from the left side, a combination of "wo", "holding", "i", "grasping", and "me" waiting for character selection units can be obtained for concatenation, thereby generating a plurality of candidate texts.
And S305, eliminating the candidate texts with the confidence degrees lower than a preset threshold value.
In the case where candidate texts are obtained in step S304, confidence may be calculated for these candidate texts and part of the candidate texts may be excluded based on the confidence in step S305. For example, semantic justification of the candidate text may be calculated. For example, in the above case, the confidence for "grip" and "i am" in the candidate text is high, in other words, the two candidate texts are semantically reasonable. In contrast, other candidate texts have a low confidence, i.e., meaning that they are semantically unreasonable.
Thereafter, the above steps S304 to S305 may be iteratively performed in sequence next to other characters in the text to be processed input by the user, and in each round of calculation, the candidate text determined in the calculation before the current round may be spliced with the candidate text corresponding to the predetermined number of chinese characters selected in the current round, thereby forming the calculation object of the current round. For example, if in the previous round of calculation, two candidate texts of "grip" and "i am" are determined as candidate texts obtained in this round, in the current round, candidate texts of characters "day" adjacent to the character "jing" in the text to be processed in the literal position, such as "sweet", "add", etc., may be formed into new candidate texts in a manner such as concatenation, and a confidence degree is calculated for the new candidate texts. And in analogy, until the confidence degree is calculated by the candidate text corresponding to the last character in the text to be processed input by the user, the candidate text lower than the preset threshold value can be eliminated, and therefore the candidate text is determined for the whole text information through the steps.
In addition, after determining candidate texts corresponding to the entire text to be processed through the steps described above, in the embodiment of the present application, the word processing method may further include:
s306, comparing the text to be processed with the corrected text.
S307, according to the comparison result, determining the first character in the text to be processed, which is different from the first character in the corrected text in the corresponding position, as an error character.
According to the embodiment of the application, after the corrected text of the text to be processed input by the user is determined according to the confidence, the text to be processed input originally by the user can be subjected to proofreading processing according to the corrected text, so that errors in the text to be processed input by the user can be determined, and the error can be presented to the user.
According to the word processing method, the confidence coefficient of each character to the semantics of the input text is calculated based on each character divided from the input text and the expanded candidate word unit based on each character, and the candidate word block with the highest confidence coefficient in the semantics of the input text is selected to correct the corresponding character in the input text, so that the advantages of two schemes of word recognition and semantic recognition of word granularity can be integrated, and the accuracy of word correction can be improved.
Example four
Fig. 4 is a schematic structural diagram of an embodiment of a word processing apparatus according to the present application, which can be used to execute the method steps shown in fig. 2 and fig. 3. As shown in fig. 4, the word processing device may include: a segmentation module 41, a first acquisition module 42, a first generation module 43, a first calculation module 44 and a first selection module 45.
The segmentation module 41 may be configured to perform character segmentation on the text to be processed to obtain a plurality of characters. In embodiments of the present application, a word processing device may obtain various text information from various sources of textual data. For example, various text contents can be input by a user through an input tool such as a keyboard, or voice can be input through a voice acquisition device and the voice input by the user is converted into text contents through a voice recognition program. For another example, the text to be collated may also be acquired from the internet as the text to be processed in the present application.
In the embodiment of the present application, when the text to be processed is obtained, the segmentation module 41 may divide the text to be processed by using a character as a unit, so as to obtain each character in the text to be processed. Generally, the text information input by the user may include various characters, such as chinese, english, numerals, special characters, and so on. In the embodiment of the present application, characters and characters of various languages may be processed, and as described above, characters of other western and european languages, such as english, may be processed based on characters and segmented, so that correction may be performed using various technical means in the prior art, and of course, may be processed using the scheme of the present application.
The first obtaining module 42 may be configured to obtain a plurality of candidate text units in a plurality of characters. For example, each candidate word unit may include at least one Chinese character. After the segmentation module 41 obtains the respective characters, the first obtaining module 42 may obtain a candidate word unit of each kanji character based on the character. Here, the candidate word unit of each character in the text to be processed may be a candidate word obtained by associating and expanding according to the character. For example, in the embodiment of the present application, the first obtaining module 42 may perform association and expansion by using pronunciation, font, and the like to obtain a candidate word of each kanji character.
Thus, for example, in the present embodiment, the first obtaining module 42 may include: a first obtaining unit 421, configured to obtain, according to a pronunciation of one character, a second character that is the same as or similar to the pronunciation of the one character as a candidate character unit; and/or obtaining a second character combination which is the same as or similar to the pronunciation of the character combination as a candidate character unit according to the pronunciation of a predetermined number of adjacent character combinations in the text to be processed. For example, in the embodiment of the present application, the first obtaining unit 421 may obtain the corresponding candidate word according to the pronunciation of a kanji character by using, for example, a pinyin confusion set or other pinyin databases. Specifically, the first obtaining unit 421 may obtain, as the candidate word unit, a character having the same or similar pronunciation as the character from the pinyin confusion set according to the pronunciation of each kanji character of the text to be processed input by the user. For example, in a case where the user inputs "i jingtian 11 dots" as the text to be processed, the first acquisition unit 421 may perform a pronunciation association on the character "i" obtained by the segmentation module 41, for example, may obtain "handshake", "wo", and "i" equivalent pronunciation characters, and may perform a pronunciation association on the character "jing", for example, may obtain "tight", "today", "fine", and "jin" waiting characters as candidate words.
Further, in the embodiment of the present application, in addition to obtaining individual characters corresponding to individual characters in a text to be processed as candidate words, the first obtaining unit 421 may obtain chinese character combinations having the same or similar pronunciations from the pronunciations of a predetermined number of adjacent character combinations in the text to be processed as candidate words, and may obtain combination candidate words such as "grip", "hold", "my present", "my close", and the like for the chinese character combination "i jing", for example.
In addition, in the embodiment of the present application, the first obtaining module 42 may further include a second obtaining unit 422, configured to obtain, according to the font style of one character, a third character that is the same as or similar to the font style of one character as a candidate text unit; and/or obtaining a third character combination which is the same as or similar to the character pattern of the character combination as a candidate character unit according to the character patterns of the adjacent preset number of character combinations in the text to be processed. For example, in the embodiment of the present application, a font confusion set or other font database may be utilized to obtain a corresponding candidate word according to the font of a character. Specifically, the second obtaining unit 422 may obtain, as the candidate word unit, a character having a similar font style to the character from the font confusion set according to the font style of one character of the text to be processed input by the user. For example, in a case where the user inputs "i jingtian 11 dots" as the text to be processed, the second obtaining unit 422 may perform font association on the character "jing" obtained by the segmentation module 41, and for example, words with similar fonts, such as "surprise", "competition", and the like, may be obtained as candidate words.
Further, in the embodiment of the present application, in addition to obtaining individual characters corresponding to individual characters in the text to be processed as candidate words, the second obtaining unit 422 may obtain a character combination having an approximate font from the font of a combination of individual characters and characters that are close in character position in the text to be processed as candidate words, for example, a combination candidate word such as "a flat stone with ropes attached at the sides surprise", "my beijing" or the like may be obtained for the character "my beijing".
The first generation module 43 may be configured to generate a plurality of candidate texts for one or more units of the plurality of candidate word units. In the embodiment of the present application, after the first obtaining module 42 obtains candidate words, that is, candidate word units, corresponding to respective characters in the text to be processed, the first generating module 43 may generate candidate texts based on the candidate words. For example, the concatenation process may be performed on one or more units of the candidate word units in a predetermined order to obtain a plurality of candidate texts.
The first calculation module 44 may be used to calculate the confidence levels of the plurality of candidate texts. The first calculation module 44 may calculate confidence levels for these candidate texts, i.e. combinations of candidate words, generated by the first generation module 43. In the embodiment of the present application, the confidence of the candidate text may be a semantic confidence of the candidate text, for example, a semantic rationality of the candidate text. For example, the confidence for "i'm hug" and "i'm today" in the candidate texts is high, in other words, the two candidate texts are semantically reasonable. In contrast, the confidence of "Beijing paste" or "Hold the day" or "Wo today" is low, i.e., meaning semantically unreasonable.
Therefore, in this embodiment of the present application, the word processing apparatus may further include: and an excluding module 46, configured to exclude candidate texts with confidence degrees lower than a preset threshold.
For example, the first generating module 43 may select a predetermined number of first characters from the text to be processed in a predetermined order according to the word positions of the plurality of kanji characters in the text to be processed. And selecting corresponding candidate word units from the candidate word units according to the selected predetermined number of characters, and splicing and combining the candidate word units with the determined candidate word units according to the word position sequence of the corresponding characters in the text to be processed to form a plurality of candidate texts.
Thereafter, the first calculation module 44 may calculate the semantic confidence of the candidate text, and may perform exclusion by the candidate text whose confidence is lower than the preset threshold by the exclusion module 46. The first selection module 45 may thus rank the plurality of candidate texts according to the confidence level. And determining at least one candidate text as a candidate text corresponding to the selected predetermined number of characters according to the sorting result;
thereafter, the first generation module 43, the first calculation module 44, and the exclusion module 46 may perform the above processes iteratively in sequence on the characters in the text to be processed input by the user, and in each round of calculation, the first generation module 43 may splice the candidate text determined by calculation by the first calculation module 44 before the current round and the candidate text corresponding to the predetermined number of characters selected by the first generation module 43 in the current round, thereby forming the calculation object of the current round. For example, if in the previous round of calculation, two candidate texts, i.e., "grip" and "i am" are determined as candidate texts obtained in the round, in the current round, the first generation module 43 may form, for example, a new candidate text by concatenating candidate texts of "day" of the character adjacent to the character "jing" in the text to be processed in the literal position, e.g., "sweet", "add", and the like, and calculate a confidence degree for the new candidate text by the first calculation module 44, and exclude candidate texts below a preset threshold value by the exclusion module 46. And the like until the candidate text corresponding to the last character in the text to be processed input by the user calculates the confidence level and determines the candidate text for the whole text information.
In addition, after determining candidate texts corresponding to the entire text to be processed as described above, in the embodiment of the present application, the word processing apparatus may further include: an alignment module 47 and a determination module 48.
The comparison module 47 may be used to compare the text to be processed with the corrected text. And the determining module 48 may be configured to determine, according to the comparison result, a character in the text to be processed, which is different from the character in the corrected text at the corresponding position, as an error character.
According to the embodiment of the present application, after the first selection module 45 determines the corrected text of the text to be processed input by the user according to the confidence level, the text to be processed input originally by the user may be subjected to a proofreading process by using the comparison module 47 and the determination module 48 according to the corrected text to determine an error in the text to be processed input by the user.
The word processing device of the embodiment of the application calculates the confidence of each character to the semantics of the input text based on each character divided from the input text and the candidate character unit based on the expansion of each character, and selects the candidate character block with the highest confidence on the semantics of the input text to correct the corresponding character in the input text, so that the advantages of two schemes of character recognition and semantic recognition of the character granularity can be integrated, and the accuracy of character correction can be improved.
EXAMPLE five
The internal functions and structures of the text processing apparatus, which can be implemented as an electronic device, are described above. Fig. 5 is a schematic structural diagram of an embodiment of an electronic device provided in the present application. As shown in fig. 5, the electronic device includes a memory 51 and a processor 52.
The memory 51 stores programs. In addition to the above-described programs, the memory 51 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.
The memory 51 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The processor 52 is not limited to a Central Processing Unit (CPU), but may be a processing chip such as a Graphic Processing Unit (GPU), a Field Programmable Gate Array (FPGA), an embedded neural Network Processor (NPU), or an Artificial Intelligence (AI) chip. And a processor 52, coupled to the memory 51, for executing the program stored in the memory 51, wherein the program is executed to execute the word processing method of the second and third embodiments.
Further, as shown in fig. 5, the electronic device may further include: communication components 53, power components 54, audio components 55, display 56, and other components. Only some of the components are schematically shown in fig. 5, and it is not meant that the electronic device comprises only the components shown in fig. 5.
The communication component 53 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, 3G, 4G, or 5G, or a combination thereof. In an exemplary embodiment, the communication component 53 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 53 further comprises a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
A power supply component 54 provides power to the various components of the electronic device. The power components 54 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for an electronic device.
The audio component 55 is configured to output and/or input audio signals. For example, the audio component 75 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 51 or transmitted via the communication component 53. In some embodiments, audio assembly 55 also includes a speaker for outputting audio signals.
The display 56 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (14)

1. A method of word processing, comprising:
performing character segmentation on a text to be processed to obtain a plurality of characters;
obtaining a plurality of candidate character units in the characters, wherein each candidate character unit comprises at least one Chinese character;
generating a plurality of candidate texts for one or more units in the plurality of candidate word units, and calculating confidence degrees of the plurality of candidate texts;
selecting corrected text for the one or more cells based on the confidence level.
2. The method of claim 1, wherein obtaining a plurality of candidate text units among the plurality of characters comprises:
according to the pronunciation of one character, obtaining a second character which is the same as or similar to the pronunciation of the character as a candidate character unit;
and/or the presence of a gas in the gas,
and according to the pronunciations of a predetermined number of adjacent character combinations in the text to be processed, obtaining a second character combination which is the same as or similar to the pronunciations of the character combinations as a candidate character unit.
3. The method of claim 1, wherein obtaining a plurality of candidate text units among the plurality of characters comprises:
according to the font of one character, obtaining a third character which is the same as or similar to the font of the character as a candidate character unit;
and/or the presence of a gas in the gas,
and obtaining a third character combination which is the same as or similar to the character pattern of the character combination as a candidate character unit according to the character patterns of the adjacent preset number of character combinations in the text to be processed.
4. The word processing method of any of claims 1-3, wherein said generating a plurality of candidate texts for one or more units of said plurality of candidate word units comprises:
and according to a preset sequence, splicing one or more units in the candidate character units to obtain a plurality of candidate texts.
5. The word processing method of claim 1, further comprising:
and eliminating candidate texts with the confidence degrees lower than a preset threshold value.
6. The word processing method of claim 1, further comprising:
comparing the text to be processed with the corrected text;
and determining characters in the text to be processed, which are different from the characters in the corrected text at the corresponding positions, as error characters according to the comparison result.
7. A word processing device, comprising:
the segmentation module is used for performing character segmentation on the text to be processed to obtain a plurality of characters;
a first obtaining module, configured to obtain a plurality of candidate word units from the plurality of characters, where each candidate word unit includes at least one chinese character;
a first generation module for generating a plurality of candidate texts for one or more units of the plurality of candidate word units;
the first calculation module is used for calculating the confidence degrees of a plurality of candidate texts;
a first selection module to select corrected text for the one or more cells based on the confidence.
8. The word processing device of claim 7, wherein the first obtaining module comprises:
the first acquisition unit is used for acquiring a second character which is the same as or similar to the pronunciation of one character as a candidate character unit according to the pronunciation of the character;
and/or the presence of a gas in the gas,
and according to the pronunciations of a predetermined number of adjacent character combinations in the text to be processed, obtaining a second character combination which is the same as or similar to the pronunciations of the character combinations as a candidate character unit.
9. The word processing device of claim 7, wherein the first obtaining module comprises:
the second acquisition unit is used for acquiring a third character which is the same as or similar to the font of the character as a candidate character unit according to the font of the character;
and/or the presence of a gas in the gas,
and obtaining a third character combination which is the same as or similar to the character pattern of the character combination as a candidate character unit according to the character patterns of the adjacent preset number of character combinations in the text to be processed.
10. The word processing device of any of claims 7 to 9, wherein the first generation module is further configured to:
and according to a preset sequence, splicing one or more units in the candidate character units to obtain a plurality of candidate texts.
11. The word processing device of claim 7, further comprising:
and the eliminating module is used for eliminating the candidate texts with the confidence degrees lower than the preset threshold value.
12. The word processing device of claim 7, further comprising:
the comparison module is used for comparing the text to be processed with the corrected text;
and the determining module is used for determining characters which are different from the characters at the corresponding positions in the corrected text in the text to be processed as error characters according to the comparison result.
13. An electronic device, comprising:
a memory for storing a program;
a processor for executing the program stored in the memory, the program when executed performing the word processing method of any of claims 1 to 6.
14. A computer-readable storage medium, on which a computer program is stored which is executable by a processor, wherein the program, when executed by the processor, implements a word processing method according to any one of claims 1 to 6.
CN202010328148.3A 2020-04-23 2020-04-23 Word processing method and device, electronic equipment and computer readable storage medium Active CN113553832B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010328148.3A CN113553832B (en) 2020-04-23 2020-04-23 Word processing method and device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010328148.3A CN113553832B (en) 2020-04-23 2020-04-23 Word processing method and device, electronic equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN113553832A true CN113553832A (en) 2021-10-26
CN113553832B CN113553832B (en) 2024-07-23

Family

ID=78129444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010328148.3A Active CN113553832B (en) 2020-04-23 2020-04-23 Word processing method and device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113553832B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001229162A (en) * 2000-02-15 2001-08-24 Matsushita Electric Ind Co Ltd Method and device for automatically proofreading chinese document
CN1387650A (en) * 1999-11-05 2002-12-25 微软公司 Language input architecture for converting one text form to another text form with minimized typographical errors and conversion errors
US20080221890A1 (en) * 2007-03-06 2008-09-11 Gakuto Kurata Unsupervised lexicon acquisition from speech and text
CN106528532A (en) * 2016-11-07 2017-03-22 上海智臻智能网络科技股份有限公司 Text error correction method and device and terminal
US20180349346A1 (en) * 2017-06-02 2018-12-06 Apple Inc. Lattice-based techniques for providing spelling corrections
CN110457688A (en) * 2019-07-23 2019-11-15 广州视源电子科技股份有限公司 Error correction processing method and device, storage medium and processor
CN110807319A (en) * 2019-10-31 2020-02-18 北京奇艺世纪科技有限公司 Text content detection method and device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1387650A (en) * 1999-11-05 2002-12-25 微软公司 Language input architecture for converting one text form to another text form with minimized typographical errors and conversion errors
JP2001229162A (en) * 2000-02-15 2001-08-24 Matsushita Electric Ind Co Ltd Method and device for automatically proofreading chinese document
US20080221890A1 (en) * 2007-03-06 2008-09-11 Gakuto Kurata Unsupervised lexicon acquisition from speech and text
CN106528532A (en) * 2016-11-07 2017-03-22 上海智臻智能网络科技股份有限公司 Text error correction method and device and terminal
US20180349346A1 (en) * 2017-06-02 2018-12-06 Apple Inc. Lattice-based techniques for providing spelling corrections
CN110457688A (en) * 2019-07-23 2019-11-15 广州视源电子科技股份有限公司 Error correction processing method and device, storage medium and processor
CN110807319A (en) * 2019-10-31 2020-02-18 北京奇艺世纪科技有限公司 Text content detection method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113553832B (en) 2024-07-23

Similar Documents

Publication Publication Date Title
JP5462001B2 (en) Contextual input method
CN112016310A (en) Text error correction method, system, device and readable storage medium
US8543375B2 (en) Multi-mode input method editor
KR101435265B1 (en) Method for disambiguating multiple readings in language conversion
TWI443551B (en) Method and system for an input method editor and computer program product
US7917355B2 (en) Word detection
US10803241B2 (en) System and method for text normalization in noisy channels
CN110765763A (en) Error correction method and device for speech recognition text, computer equipment and storage medium
CN101815996A (en) Detect name entities and neologisms
KR20150128921A (en) Detection and reconstruction of east asian layout features in a fixed format document
CN101133411A (en) Fault-tolerant romanized input method for non-roman characters
WO2015139497A1 (en) Method and apparatus for determining similar characters in search engine
US8725497B2 (en) System and method for detecting and correcting mismatched Chinese character
TWI567569B (en) Natural language processing systems, natural language processing methods, and natural language processing programs
US8543382B2 (en) Method and system for diacritizing arabic language text
CN113282701B (en) Composition material generation method and device, electronic equipment and readable storage medium
CN104239289A (en) Syllabication method and syllabication device
CN113642316A (en) Chinese text error correction method and device, electronic equipment and storage medium
CN104182381A (en) character input method and system
US20200285324A1 (en) Character inputting device, and non-transitory computer readable recording medium storing character inputting program
JP2018163660A (en) Method and system for readability evaluation based on english syllable calculation method
JP2011008784A (en) System and method for automatically recommending japanese word by using roman alphabet conversion
CN111046627A (en) Chinese character display method and system
KR102468975B1 (en) Method and apparatus for improving accuracy of recognition of precedent based on artificial intelligence
CN113553832B (en) Word processing method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant