CN111209748B - Error word recognition method, related device and readable storage medium - Google Patents

Error word recognition method, related device and readable storage medium Download PDF

Info

Publication number
CN111209748B
CN111209748B CN201911293645.8A CN201911293645A CN111209748B CN 111209748 B CN111209748 B CN 111209748B CN 201911293645 A CN201911293645 A CN 201911293645A CN 111209748 B CN111209748 B CN 111209748B
Authority
CN
China
Prior art keywords
word
new
text
original
original text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911293645.8A
Other languages
Chinese (zh)
Other versions
CN111209748A (en
Inventor
王培养
方昕
李永超
刘海波
方磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Ustc Iflytek Co ltd
Original Assignee
Hefei Ustc Iflytek Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Ustc Iflytek Co ltd filed Critical Hefei Ustc Iflytek Co ltd
Priority to CN201911293645.8A priority Critical patent/CN111209748B/en
Publication of CN111209748A publication Critical patent/CN111209748A/en
Application granted granted Critical
Publication of CN111209748B publication Critical patent/CN111209748B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a method for identifying a misclassification word, a related device and a readable storage medium, wherein an original text to be identified possibly has continuous words due to a word or a plurality of word writing errors. Therefore, based on the scheme of the application, the recognition of the misclassification word in the text can be realized.

Description

Error word recognition method, related device and readable storage medium
Technical Field
The present application relates to the field of natural language processing, and more particularly, to a method for recognizing a misother word, a related device, and a readable storage medium.
Background
With the rapid development of information processing technology and internet, traditional text work is almost completely replaced by computers, and electronic texts such as electronic books, electronic newspapers, electronic mails, office files, blogs, microblogs and the like become a part of daily life of people, however, text wrongly written words are more and more, and finding wrongly written words is an important application aspect of text quality inspection.
Therefore, how to identify the miswords in the text is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the foregoing, the present application provides a method, related apparatus, and readable storage medium for recognizing a misclassification word. The specific scheme is as follows:
a method of identifying a misclassification word, comprising:
acquiring an original word sub-string in an original text to be identified, wherein the original word sub-string is generated by continuous word combination in the original text;
determining a new multi-word corresponding to the original text according to the original word substring;
and identifying the misclassified words in the original text based on the newly added multi-word.
Optionally, the determining, according to the original word substring, a new added multi-word corresponding to the original text includes:
acquiring a new word sub-string corresponding to the original word sub-string, wherein the length and the tone-free pinyin of the new word sub-string and the original word sub-string are the same;
replacing the original word sub-string with the new word sub-string to obtain a new sentence corresponding to the original text;
acquiring the multi-word of the new sentence corresponding to the original text;
and determining the new multi-word from the multi-word words of the new sentence corresponding to the original text.
Optionally, the replacing the original word sub-string with the new word sub-string to obtain a new sentence corresponding to the original text includes:
replacing the original word substring with the new word substring to obtain a new sentence to be selected, which corresponds to the original text;
and selecting a new sentence corresponding to the original text from the new sentences to be selected.
Optionally, the determining the new multi-word from the multi-word words of the new sentence corresponding to the original text includes:
acquiring word frequency of each multi-word in the multi-word of the new sentence corresponding to the original text;
and arranging the multi-word words of the new sentence corresponding to the original text in a descending order according to word frequency, and selecting a preset number of multi-word words which are ranked at the front as the new multi-word words.
Optionally, the identifying the misclassified word in the original text based on the newly added multi-word includes:
selecting a target new sentence from the new sentences corresponding to the original text, wherein the target new sentence is a new sentence containing the new added multi-word;
acquiring a replacement word corresponding to the original text based on the target new sentence;
And determining the character substring corresponding to the replacement word in the original text as the misclassification word in the original text.
Optionally, the acquiring, based on the target new sentence, a replacement word corresponding to the original text includes:
replacing an original sentence corresponding to the target new sentence in the original text with the target new sentence to obtain a new text corresponding to the original text;
acquiring new multi-word in the new text corresponding to the original text;
and determining the replacement word corresponding to the original text according to the new multi-word in the new text corresponding to the original text.
Optionally, the replacing the original sentence corresponding to the target new sentence in the original text with the target new sentence to obtain a new text corresponding to the original text includes:
replacing an original sentence corresponding to the target new sentence in the original text by the target new sentence to generate a new text set to be selected;
calculating the similarity between each new text in the new text set to be selected and the original text;
and selecting the new text corresponding to the original text from the new text set to be selected according to the similarity between each new text and the original text.
Optionally, the calculating the similarity between each new text in the new text set to be selected and the original text includes:
acquiring a text information feature vector of each new text and a feature characterization vector of each new text;
acquiring a text information feature vector of the original text and a feature characterization vector of the original text;
and calculating the similarity between each new text and the original text according to the text information feature vector of each new text, the feature characterization vector of each new text, the text information feature vector of the original text and the feature characterization vector of the original text.
Optionally, obtaining the feature characterization vector includes:
acquiring text information feature vectors, new vocabulary feature vectors and replacement word feature vectors;
and generating the feature characterization vector according to the text information feature vector, the new vocabulary feature vector and the replacement word feature vector.
Optionally, the selecting the new text corresponding to the original text from the new text set to be selected according to the similarity between each new text and the original text includes:
and arranging the new texts to be selected in a descending order according to the similarity, and selecting a preset number of new texts with the front sequence and/or selecting a preset number of new texts with the rear sequence as the new texts corresponding to the original texts.
A misclassification device comprising:
the device comprises an acquisition unit, a recognition unit and a recognition unit, wherein the acquisition unit is used for acquiring an original word sub-string in an original text to be recognized, and the original word sub-string is generated by continuous word combination in the original text;
the determining unit is used for determining a new multi-word corresponding to the original text according to the original word substring;
and the identification unit is used for identifying the misclassified words in the original text based on the newly added multi-word words.
Optionally, the determining unit includes:
a new word sub-string obtaining unit, configured to obtain a new word sub-string corresponding to the original word sub-string, where the length and the tone-free pinyin of the new word sub-string are the same as those of the original word sub-string;
a new sentence acquisition unit, configured to replace the original word sub-string with the new word sub-string, to obtain a new sentence corresponding to the original text;
a multi-word obtaining unit for obtaining the multi-word of the new sentence corresponding to the original text;
and the new multi-word determining unit is used for determining the new multi-word from the multi-word of the new sentence corresponding to the original text.
Optionally, the new sentence acquisition unit includes:
a new sentence to be selected obtaining unit, configured to replace the original word sub-string with the new word sub-string, to obtain a new sentence to be selected corresponding to the original text;
And a new sentence selecting unit, configured to select the new sentence corresponding to the original text from the new sentences to be selected.
Optionally, the new multi-word determining unit includes:
a word frequency acquisition unit, configured to acquire a word frequency of each multi-word in the multi-word of the new sentence corresponding to the original text;
the new multi-word selecting unit is used for arranging the multi-words of the new sentences corresponding to the original text in a descending order according to word frequency, and selecting a preset number of multi-words which are ranked at the front as the new multi-word.
Optionally, the identifying unit includes:
a target new sentence selecting unit, configured to select a target new sentence from the new sentences corresponding to the original text, where the target new sentence is a new sentence containing the new added multi-word;
a replacement word obtaining unit, configured to obtain a replacement word corresponding to the original text based on the target new sentence;
and the error word determining unit is used for determining that the character substring corresponding to the replacement word in the original text is the error word in the original text.
Optionally, the replacement word obtaining unit includes:
A new text obtaining unit, configured to replace an original sentence corresponding to the target new sentence in the original text with the target new sentence, and obtain a new text corresponding to the original text;
a new text newly increased word obtaining unit, configured to obtain a new increased word in the new text corresponding to the original text;
and the replacement word determining unit is used for determining the replacement word corresponding to the original text according to the new multi-word in the new text corresponding to the original text.
Optionally, the new text obtaining unit includes:
the new text set to be selected generating unit is used for replacing an original sentence corresponding to the target new sentence in the original text by the target new sentence to generate a new text set to be selected;
a similarity calculating unit, configured to calculate a similarity between each new text in the new text set to be selected and the original text;
and a new text selection unit, configured to select, according to the similarity between each new text and the original text, the new text corresponding to the original text from the new text set to be selected.
Optionally, the similarity calculating unit includes:
The obtaining unit is used for obtaining the text information feature vector of each new text and the feature characterization vector of each new text; acquiring a text information feature vector of the original text and a feature characterization vector of the original text;
the computing unit is used for computing the similarity between each new text and the original text according to the text information feature vector of each new text, the feature characterization vector of each new text, the text information feature vector of the original text and the feature characterization vector of the original text.
Optionally, the acquiring unit is specifically configured to:
acquiring text information feature vectors, new vocabulary feature vectors and replacement word feature vectors;
and generating the feature characterization vector according to the text information feature vector, the new vocabulary feature vector and the replacement word feature vector.
Optionally, the new text selection unit is specifically configured to:
and arranging the new texts to be selected in a descending order according to the similarity, and selecting a preset number of new texts with the front sequence and/or selecting a preset number of new texts with the rear sequence as the new texts corresponding to the original texts.
A misclassification word recognition system comprising a memory and a processor;
The memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the method for recognizing a misclassification word as described above.
A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a method for identifying a misclassification word as described above.
By means of the technical scheme, the application discloses a method for identifying a multi-word, related equipment and a readable storage medium, wherein the original text to be identified possibly has continuous words due to a word or a plurality of word writing errors, in the method, an original word sub-string is generated by acquiring the continuous word combination in the original text, and a plurality of continuous words corresponding to the original text can be determined by the newly-added multi-word according to the original word sub-string. Therefore, based on the scheme of the application, the recognition of the misclassification word in the text can be realized.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a schematic flow chart of a method for recognizing a misclassification word according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a method for obtaining a new word sub-string corresponding to an original word sub-string according to an embodiment of the present application;
FIG. 3 is a schematic diagram of obtaining at least one new sentence corresponding to each original sentence according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a device for recognizing a misclassification word according to an embodiment of the present application;
fig. 5 is a block diagram of a hardware structure of a system for recognizing a misclassification word according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
For reference and clarity, the description and summary of terms used in the embodiments of the present application are as follows:
word: single words and single words are referred to;
multi-word: words of 2 and above;
Word substring: the character string which is not formed into words continuously can be formed by 1 character, 2 or more characters;
pinyin substring: a substring sequence formed by the non-modulated pinyin for the single word pairs of the word substring;
misidentification words: the misidentification word or sensitive word in any text, especially 2 or more continuous words;
general vocabulary dictionary: a dictionary formed by the universal vocabulary of 2 words and above;
business vocabulary dictionary: a dictionary formed by words in a specific field of 2 words and above;
disabling the word dictionary: 2 words and above general word resources;
chinese character-Pinyin resource: realizing that a Chinese character searches for the corresponding pinyin or the non-modulated pinyin;
pinyin-chinese character resource: realizing one pinyin/non-aligned pinyin to find all corresponding Chinese characters;
non-tone pinyin: pinyin without tone;
multi-word dictionary: dictionary formed by 2 words and above, as shown in the following table:
snowman fight
Snowman fort
Ice hockey
Misclassification word mapping resource: and a resource for establishing a relationship between the misidentified word and the word actually expressing the meaning. The following table shows:
misclassification word Correct words
School bus station Snowman fight
Bag epitaxy for school students Snowman fort
Aronia melanocarpa (Roxb.) kuntze Ice hockey
The method for recognizing the misclassification of the present disclosure will be described in detail by the following examples.
Referring to fig. 1, fig. 1 is a flow chart of a method for recognizing a misclassification word, which includes the following steps:
s101: and acquiring an original word substring in the original text to be identified.
In the present application, the original word substring is generated from successive word combinations in the original text. The method for obtaining the original word sub-string in the original text can be various, for example, for the case that the original word sub-string of the original text to be recognized is already generated in advance, the original word sub-string can be directly obtained, and for the case that the original word sub-string of the original text to be recognized is not generated temporarily, the original word sub-string can be obtained by generating the original word sub-string.
S102: and determining a new multi-word corresponding to the original text according to the original word substring.
In the present application, for each original word sub-string, the newly added multi-word determined based on the original word sub-string is a multi-word that the original word sub-string does not have in the original sentence at the corresponding position in the original text. It should be noted that, in the present application, all the multi-word words corresponding to the original word sub-string may be determined according to the original word sub-string, and these multi-word words are compared with the multi-word words of the original sentence in the original text where the original word sub-string is located, so that it is determined that some of these multi-word words but none of these multi-word words are used as the newly added multi-word words corresponding to the original text. Alternatively, a multi-word satisfying a certain condition may be further selected from among the multi-words, but not from the original sentence, as a newly added multi-word corresponding to the original text, which is not limited in any way.
S103: and identifying the misclassified words in the original text based on the newly added multi-word.
In the application, the character sub-string corresponding to the newly added multi-character word in the original text can be determined to be the error word in the original text. The word sub-strings meeting certain conditions can be further screened from the word sub-strings corresponding to the new multi-word in the original text to be the misclassified words in the original text, and the application is not limited in any way.
In the method, an original word sub-string is generated by acquiring the combination of the continuous words in the original text, and a new multi-word corresponding to the original text is acquired according to the original word sub-string, and a plurality of continuous words corresponding to the new multi-word in the original text can be determined by the new multi-word. Therefore, based on the scheme of the application, the recognition of the misclassification word in the text can be realized.
As an embodiment, in the present application, a method for generating an original word sub-string of an original text is disclosed, which may specifically include the following steps:
S201: and segmenting each original sentence in the original text to obtain a word sequence of each original sentence.
In the application, a word segmentation model trained in advance can be adopted to segment each original sentence in the original text, so as to obtain a word sequence of each original sentence.
For ease of understanding, assume that one of the original sentences of one original paragraph in one original text is: "little child and child are standing at school, influence the process of the small flower's Aronia melanocarpa and the construction of the bag of the child of the Arman. "for this original sentence, word segmentation processing is performed on the original sentence, so as to obtain the word sequence of the original sentence, which is specifically" Xiaoming\and\darling\positive\in\learning\man\standing\, the method comprises the steps of affecting the process of the microflower, the stem, the process of the catalpa, the learning of the Arman, the inclusion of the epitaxy and the construction. And \is carried out.
It should be noted that, based on the word sequences of all the original sentences, the word sequences of the original text can be obtained.
For easy understanding, a piece of original text is preset, and divided into m original paragraphs P according to a paragraph structure 1 ,P 2 ,...,P i ,...,P m For an original paragraph P i By clause symbol "; . The following is carried out ? "clause, n can be obtained i Original sentencesFor the original sentence S ij And performing word segmentation processing to obtain characters and multi-character words. These words and multi-word are the word sequence of the original text +.>In the application, i represents paragraph index, the value is an integer of more than or equal to 1 and less than or equal to m, j represents sentence index, the value is more than or equal to 1 and less than or equal to n i T represents word index, and the value is more than or equal to 1 and less than or equal to P ij Is an integer of (a).
S202: and matching the word sequence of each original sentence with a preset multi-word dictionary, and determining the words in the word sequence of each original sentence.
In the application, the preset multi-word dictionary can be specifically any one or more of a universal vocabulary dictionary, a business vocabulary dictionary, a dead word dictionary and the like. And determining successfully matched multi-word words in the word sequence of each original sentence, and unsuccessfully matched words in the word sequence of each original sentence by matching the word sequence of each original sentence with a preset multi-word dictionary. It should be noted that the multi-word words of all the original sentences constitute the multi-word words of the original text.
For easy understanding, the word sequences of the original sentences are named' Xiaoming\and\non-child\at\learning\man\station\, the method comprises the steps of affecting the process of the microflower, the stem, the process of the catalpa, the learning of the Arman, the inclusion of the epitaxy and the construction. After matching the \ "with the preset multi-word dictionary, the words in the word sequence of the original sentence can be determined as" sum "," positive "," beating "," learning "," person "," standing "," handle "," stick "," catalpa "," and "," learning "," person "," bag "," epitaxy ".
S203: and merging continuous words in the word sequence of each original sentence to obtain an original word substring in the original text.
In the application, the original word substring in each original sentence can be obtained by merging the continuous words in the word sequence of each original sentence, and for convenience of understanding, the continuous words of the word sequences of the original sentences are respectively combined in the ways of beating, learning, standing, handle, stick, mansari, epitaxy, so that the original word substrings in the original text can be obtained: "and", "standing in the middle of school", "mountain ash", "and", "school bag epitaxy", ". ".
It should be noted that, based on the original word sub-strings in all the original sentences, the original word sub-strings in the original text can be obtained. In the present application, SC can be used ijk Representing an original word sub-string in the original text, wherein k represents an index of the original word sub-string, and the value is greater than or equal to 1 and less than or equal to q ij Is an integer of (a).
As an implementation manner, in the present application, an implementation manner of determining a new added multi-word corresponding to an original text according to an original word sub-string is disclosed, and the method may include the following steps:
S301: and acquiring a new word sub-string corresponding to the original word sub-string.
In the application, the new character substring has the same length and the tone-free pinyin as the original character substring. As an implementation manner, the present application discloses an implementation manner for obtaining a new word sub-string corresponding to an original word sub-string, which may include the following steps:
s401: and obtaining the non-modulated pinyin sequence corresponding to the original character substring.
In the application, the original character sub-string can be split by taking the character as a unit to obtain the character sequence of the original character sub-string, each character in the sub-sequence is searched for, the non-modulated pinyin of each character is obtained by searching the Chinese character-pinyin resource, and the non-modulated pinyin sequence corresponding to the original character sub-string is further obtained.
S402: and acquiring a word network corresponding to the non-modulated pinyin sequence.
In the application, for each pinyin in the non-modulated pinyin sequence, all words corresponding to each pinyin are obtained by searching pinyin-Chinese character resources, and all words corresponding to all the pinyin are formed into a word network by taking the pinyin sequence as a time sequence.
S403: and extracting a word from each moment of the word network according to the time sequence relation of the non-modulated pinyin sequence, and combining to obtain a new word substring corresponding to the original word substring.
According to the application, a word can be extracted from each moment of the word network according to the time sequence relation of the non-modulated pinyin sequence for combination to obtain a new word sub-string corresponding to the original word sub-string, and all the new word sub-strings corresponding to the original word sub-string can be obtained through multiple combinations.
For easy understanding, referring to fig. 2, fig. 2 is a schematic diagram of a method for obtaining a new word string corresponding to an original word string according to an embodiment of the present application, in fig. 2, the "stick-stick" of the original word string "is taken as an example to describe, and as can be seen in fig. 2, the" stick-stick "of the word string" can be split into a word sequence "; by searching the Chinese character-pinyin resource, the pinyin sequence 'de/di-bing-gun-qiu' of the character sequence is obtained; looking up pinyin-Chinese character resources to obtain all the characters of each pinyin and generate a character network. By connecting any word in each row of the word network with any word in the next moment, a new word substring of the word substring 'Aronia melanocarpa' can be generated, specifically, a 'hockey ball, cake roller fur, and jockey roller …' shown in fig. 2.
Based on the above, a new word sub-string corresponding to each original word sub-string can be obtained.
S302: and replacing the original word substring with the new word substring to obtain a new sentence corresponding to the original text.
In the application, at least one original word sub-string in each original sentence can be replaced by a new word sub-string corresponding to the at least one original word sub-string, so as to obtain at least one new sentence corresponding to each original sentence.
For ease of understanding, please refer to fig. 3, fig. 3 is a schematic diagram of obtaining at least one new sentence corresponding to each original sentence according to an embodiment of the present application, and in fig. 3, the original word strings "and", "standing on learning," and "," learning package "are used. As shown in fig. 3, the original word sub-strings are "and", "drink", "what" …, the original word sub-strings are "playing a snowman fight", "playing a learning man fight", "finishing a learning man fight" …, the new word sub-strings are "ice hockey", "sick roller haunch", "binker fur" …, the new word sub-strings are "and", "drink", "what" …, and the learning man packet of the original word sub-strings "is" snowman fort "," learning benevolence fort "," … ….
By the ' standing at the beginning of learning ' of the original character sub-string, the ' corresponding new character sub-string ' is in the battle of snowman, ' replacing ' the standing at the beginning of learning ', the ' replacing ' of the hockey stick of the corresponding new character sub-string ' of the original character sub-string ', and replacing the snowman fort of the new character string corresponding to the original character string by the snowman fort of the original character string to obtain a new sentence of ' little and child ' shown in figure 3, namely, the snowman fort which affects the progress of the small-flower hockey and the construction of the arom. By the same token, other new sentences, such as the new sentence 'little bright drinking child' in fig. 3, can be generated by replacing the corresponding original word substring with different new word substrings, and the disease stick process of floret and the learning fort construction of the aro are affected. "
In the present application, a new sentence corresponding to an original text may be obtained based on at least one new sentence corresponding to each original sentence, and as an implementation manner, a specific implementation manner of replacing the original word sub-string with the new word sub-string to obtain a new sentence corresponding to the original text is disclosed in the present application, that is, all new sentences corresponding to all original sentences are used as new sentences corresponding to the original text.
However, each original sentence may have a plurality of corresponding new sentences, and if all new sentences corresponding to all original sentences are used as new sentences corresponding to the original text, a large number of new sentences corresponding to the original text will be generated, and processing a large number of new sentences corresponding to the original text will affect the recognition efficiency of the misclassification words.
In order to improve recognition efficiency of misother words, as another implementation manner, the application discloses a specific implementation manner of replacing the original word sub-string with the new word sub-string to obtain a new sentence corresponding to the original text, namely, replacing the original word sub-string with the new word sub-string to obtain all new sentences corresponding to all original sentences as new sentences to be selected, and selecting the new sentences corresponding to the original text from the new sentences to be selected.
It should be noted that, as can be seen from fig. 3, in the multiple new sentences corresponding to each original sentence, some new word substrings in the new sentences do not include multi-word words, and the new sentences do not help to correct errors of the misclassified words, while the new word substrings in the new sentences include multi-word words, which can be used for identifying the misclassified words and correcting the misclassified words, and do not include multi-word words.
Therefore, as an implementation manner, the application discloses a realization method for selecting a new sentence corresponding to the original text from the new sentences to be selected, which specifically comprises the following steps: and selecting a new sentence containing the multi-word from the new sentences to be selected as a new sentence corresponding to the original text.
As still another embodiment, in the present application, there is further provided a method for selecting a new sentence corresponding to an original text from new sentences to be selected, which may specifically be as follows: and acquiring the confidence score of each original sentence and the confidence score of each new sentence corresponding to the original sentence, and selecting a new sentence with the confidence score larger than the confidence score of the original sentence from all new sentences corresponding to the original sentence as a new sentence corresponding to the original text. It should be noted that, in the present application, a pre-trained language model may be used to calculate a confidence score of each original sentence and a confidence score of each new sentence corresponding to the original sentence.
S303: and acquiring the multi-word of the new sentence corresponding to the original text.
In the application, each new sentence corresponding to the original text can be segmented to obtain the word sequence of each new sentence corresponding to the original text, the word sequence of each new sentence corresponding to the original text is matched with a preset multi-word dictionary, the multi-word dictionary in each new sentence corresponding to the original text is determined to be matched with the preset multi-word dictionary, and the multi-word words which are not in the multi-word words of the original sentence corresponding to the multi-word dictionary are the multi-word words of each new sentence corresponding to the original text, and all the multi-word words of the new sentence corresponding to the original text form the multi-word words of the new sentence corresponding to the original text. The preset multi-word dictionary may be one or more of a universal vocabulary dictionary, a business vocabulary dictionary, a dead word dictionary, and the like.
S304: and determining the new multi-word from the multi-word words of the new sentence corresponding to the original text.
In the present application, a realization method for selecting newly increased words from the multi-word words of the new sentence corresponding to the original text is also provided, and the method can be as follows: acquiring word frequency of each multi-word in the multi-word of the new sentence corresponding to the original text, arranging the multi-word words of the new sentence corresponding to the original text in a descending order according to the word frequency, and selecting a preset number of multi-word words with the top order as the newly added multi-word words;
or, acquiring word frequency of each multi-word in the multi-word of the new sentence corresponding to the original text; the multi-word words of the new sentences corresponding to the original text are arranged in a descending order according to word frequency, and a preset number of multi-word words which are ranked at the front are selected as the new multi-word words;
in the present application, the preset number may be determined according to the specific situation, for example, K number, where K is an integer greater than or equal to 1 and less than or equal to the total number of the multiple words of the new sentence corresponding to the original text. Or the preset number may be determined according to a percentage of the total number of the multi-word words of the new sentence corresponding to the original text, for example, a value of 20% of the total number of the multi-word words of the new sentence corresponding to the original text is the preset number.
As an embodiment, in the present application, an implementation of identifying a misword in an original text based on a newly added multi-word is disclosed, which may include the steps of:
s501: and selecting a target new sentence from the new sentences corresponding to the original text.
In the application, the target new sentence is a new sentence containing the newly added multi-word in the new sentence corresponding to the original text.
S502: and acquiring a replacement word corresponding to the original text based on the target new sentence.
In the application, the replacement words corresponding to the original text can be all newly added multi-word words or part of newly added multi-word words. The specific implementation will be described in detail by the following examples, which will not be described in detail here.
S503: and determining the character substring corresponding to the replacement word in the original text as the misclassification word in the original text.
In the application, after the error word in the original text is determined, the error word in the original text can be further replaced by the corresponding replacement word so as to correct the error word, and in addition, the position information of the error word in the original text and the corresponding correction result can be further displayed so as to assist manual verification.
As an embodiment, in the present application, an implementation manner of acquiring a replacement word corresponding to an original text based on a target new sentence is disclosed, which may include the steps of:
s601: and replacing the original sentence corresponding to the target new sentence in the original text by the target new sentence, and obtaining a new text corresponding to the original text.
In the application, at least one original sentence in the original text can be replaced by a target new sentence corresponding to the at least one original sentence, so that all the new texts are obtained. As an embodiment, all the new texts mentioned above may be determined as new texts corresponding to the original text. However, if the subsequent processing is performed on all the new texts, the recognition efficiency of the misclassification words is low, and some of the new texts possibly contain new misclassification words, which can cause adverse effects on the misclassification word recognition of the original text, so that part of the new texts can be selected from all the new texts, and the misclassification words of the original text can be recognized based on the part of the new texts, thereby improving the misclassification word recognition efficiency and accuracy.
As an implementation manner, the application provides another implementation manner for obtaining the new text corresponding to the original text, which can improve the recognition efficiency and accuracy of the misclassification word, and specifically can be as follows: and replacing an original sentence corresponding to the target new sentence in the original text with the target new sentence, generating a new text set to be selected, calculating the similarity between each new text in the new text set to be selected and the original text, and selecting the new text corresponding to the original text from the new text set to be selected according to the similarity between each new text and the original text.
It should be noted that, the new paragraph can be obtained by replacing the corresponding original sentence with at least one target new sentence. And replacing the corresponding original paragraph with at least one new paragraph to obtain new text. And obtaining a new text set to be selected through multiple different substitutions.
In addition, in the application, the new texts to be selected can be arranged in a descending order according to the similarity, and a preset number of new texts with the front sequence are selected, and/or a preset number of new texts with the rear sequence are selected as the new texts corresponding to the original texts.
S602: and acquiring the new multi-word in the new text corresponding to the original text.
In the application, the new multi-word in the new text corresponding to the original text is the new multi-word contained in the target new sentence in the new text corresponding to the original text.
S603: and determining the replacement word corresponding to the original text according to the new multi-word in the new text corresponding to the original text.
In the application, all newly increased words in the new text corresponding to the original text can be determined as the replacement words corresponding to the original text. A preset number of new multi-word words can also be selected from all the new multi-word words in the new text corresponding to the original text to be used as the replacement words corresponding to the original text. The present application is not limited in any way.
In the present application, a realization method for selecting the replacement word corresponding to the original text from all newly added multiple words in the new text corresponding to the original text is also provided, and the method may be: acquiring word frequency of each new multi-word in the new text corresponding to the original text or the original text; each new multi-word in the new text corresponding to the original text is arranged in a descending order according to word frequency, and a preset number of new multi-words with the front ordering are selected as the replacement words corresponding to the original text;
In the present application, the preset number may be determined according to the specific situation, for example, K number, where K is an integer greater than or equal to 1 and less than or equal to the total number of the new multi-word words in the new text corresponding to the original text. Or the preset number may be determined according to a percentage of the total number of the new multi-word words in the new text corresponding to the original text, for example, a value of 20% of the total number of the new multi-word words in the new text corresponding to the original text is the preset number.
As yet another embodiment, in the present application, an implementation of calculating a similarity between each new text in a set of new texts to be selected and an original text is disclosed, which may include the following steps:
s701: and acquiring the text information feature vector of each new text and the feature characterization vector of each new text.
In the present application, the text information feature vector of each new text may be composed of a word vector of each new text, a paragraph vector of each new text, and a chapter vector of each new text. In addition, the text information feature vector of each new text, the new vocabulary feature vector of each new text and the replacement word feature vector of each new text can be obtained first, and then the feature vector of each new text is generated according to the text information feature vector of each new text, the new vocabulary feature vector of each new text and the replacement word feature vector of each new text.
S702: and acquiring a text information feature vector of the original text and a feature characterization vector of the original text.
In the present application, the text information feature vector of the original text may be composed of a word vector of the original text, a paragraph vector of the original text, and a chapter vector of the original text. In addition, the text information feature vector of the original text, the new vocabulary feature vector of the original text and the replacement word feature vector of the original text can be obtained first, and then the feature characterization vector of the original text is generated according to the text information feature vector of the original text, the new vocabulary feature vector of the original text and the replacement word feature vector of the original text.
S703: and calculating the similarity between each new text and the original text according to the text information feature vector of each new text, the feature characterization vector of each new text, the text information feature vector of the original text and the feature characterization vector of the original text.
As an implementation manner, a first similarity between the text information feature vector of each new text and the text information feature vector of the original text and a second similarity between the feature vector of each new text and the feature vector of the original text may be calculated, and then the similarity between each new text and the original text is calculated according to the first similarity and the second similarity.
As an implementation manner, the first similarity and the second similarity may be accumulated, and the similarity between each new text and the original text may be calculated.
For ease of understanding, in the present application, it is assumed that each new text is represented by a "sample", the original text is represented by a "check material", andrepresenting a first similarity of the text information feature vector of each new text to the text information feature vector of said original text with +.>A second similarity of the feature vector representing each new text to the feature vector of the original text may then be based on the formula +.>And calculating the similarity between each new text and the original text.
In the application, an original text and a new text can be collectively called as a text, based on which, the application discloses a generation mode of a characteristic characterization vector of the text, and the mode can comprise the following steps:
s801: and acquiring a text information feature vector of the text.
In the present application, the text information feature vector may be composed of a word vector of the text, a paragraph vector of the text, and a chapter vector of the text.
S802: a new vocabulary feature vector of the text is obtained.
In the application, the dimension of the new vocabulary feature vector of the text is the same as the number of the new multi-word words, if one text contains one new multi-word, the dimension of the new vocabulary feature vector of the text corresponding to the new multi-word is 1, if the new multi-word is not contained, the dimension of the new vocabulary feature vector of the text corresponding to the new multi-word is 0, in addition, the method can also be represented by a reverse method, namely if a text contains a new multi-word, the dimension value corresponding to the new multi-word in the new vocabulary feature vector of the text is 0, and if the new multi-word is not contained, the dimension value corresponding to the new multi-word in the new vocabulary feature vector of the text is 1.
For the sake of understanding, it is assumed that the number of newly increased words is 2, namely "hockey ball and snowman fort", and it is assumed that the new words are expressed by adopting an inverse method, the dimension value of the corresponding "hockey ball" in the new vocabulary feature vector is 0 if the "hockey ball" is contained, the dimension value of the corresponding "hockey ball" in the new vocabulary feature vector is 1 if the "hockey ball" is not contained, the dimension value of the corresponding "snowman fort" in the new vocabulary feature vector is 0 if the "snowman fort" is contained, and the dimension value of the corresponding "snowman fort" in the new vocabulary feature vector is 1 if the "snowman fort" is not contained. Assuming that only "hockey" is included in one text, the new vocabulary feature vector of the new text is [0,1], only "snowman fort" is included in one text, the new vocabulary feature vector of the new text is [1,0], and the "hockey" and "snowman fort" are included in one text, and the new vocabulary feature vector of the new text is [0,0].
S803: and acquiring the character vector of the replacement word of the text.
The specific implementation of this step will be described in detail by the following examples, which will not be described in detail here.
S804: and splicing the new vocabulary feature vector of the text and the replacement word feature vector of the text to generate the generated feature vector of the text.
For ease of understanding, the following examples are given in the present disclosure:
assuming that the new vocabulary feature vector of a text is [0,1], the replacement vocabulary feature vector of the text is [2/5,7/10,24/25,1/10,2/5,0,0,3/10,2/5,1/20,1/7,3/20,3/7], the generated feature vector of the text is [1,0,2/5,7/10,24/25,1/10,2/5,0,0,3/10,2/5,1/20,1/7,3/20,3/7].
S805: integrating the text information feature vector of the text with the generated feature vector of the text to generate a feature characterization vector of the text.
Assume that the text information feature vector of the text is expressed as Vec text The generated feature vector of the text is expressed as Vec k-hot +Vec 3+2*p Wherein Vec k-hot New vocabulary feature vectors, vec, expressed as text 3+2*p Expressed as a replacement term feature vector for the text, the feature characterization vector for the text is Vec text T ×(Vec k-hot +Vec 3+2*p )。
As an embodiment, in the present application, a text information feature vector generation method of a text is disclosed, specifically as follows:
s901: acquiring word vectors of the text, paragraph vectors of the text and chapter vectors of the text;
in the application, the word vector of each word in the text can be extracted through a pre-trained word vector model, and the word vectors of each word in the text are combined (using averaging or maximum value, etc.), so as to obtain the word vector of the text. Paragraph vectors of each paragraph in the text can be extracted through a pre-trained paragraph vector model, and the paragraph vectors of each paragraph in the text are combined (using averaging or maximum value, etc.), so that the paragraph vector of the text is obtained. The text chapter vector can also be extracted through a pre-trained chapter vector model.
S902: and splicing the word vector of the text, the paragraph vector of the text and the chapter vector of the text to form the text information feature vector of the text.
In the present application, assuming that the dimension of the word vector of the text is 300, the paragraph vector of the text is 100, and the dimension of the chapter vector of the text is 150, the dimension of the text information feature vector of the text is 300+100+150=550.
As an implementation manner, in the present application, a generation manner of a text replacement word feature vector is also disclosed, which specifically includes:
s1001: and determining replacement words in the text, wherein the replacement words are words in the newly added multi-word.
In the application, each vocabulary in the text can be matched with the newly added multi-word, and the successfully matched vocabulary is the replacement word.
S1002: and acquiring text information, wherein the text information comprises the number of paragraphs contained in the text, the number of sentences contained in each paragraph, and the number of vocabularies contained in each sentence.
S1003: and generating a chapter level replacement word feature vector of the text and a paragraph level replacement word feature vector of the text according to the replacement words in the text and the text information.
In this step, the number of paragraphs with the replacement words in the text and the text information can be counted, the number of sentences with the replacement words in the text can be counted, and the chapter level replacement word feature vector of the text can be obtained according to the number of paragraphs with the replacement words in the text, the number of sentences with the replacement words in the text and the number of the replacement words in the text.
As an implementation manner, the chapter level replacement word feature vector of the text may be composed of a paragraph proportion of no replacement word in the text, a sentence proportion of no replacement word in the text, and a vocabulary proportion of non-replacement word in the text, wherein;
paragraph ratio without replacement word in text = 1-number of paragraphs with replacement word in text/number of paragraphs contained in text;
the proportion of sentences without replacement words in the text = 1-the number of sentences with replacement words in the text/the total number of sentences contained in the text, the number of sentences with replacement words in the text being the sum of the number of sentences with replacement words in each paragraph in the text, the total number of sentences contained in the text being the sum of the number of sentences contained in each paragraph in the text.
The word ratio of non-replacement words in the text = 1-the number of replacement words in the text/the total number of words contained in the text, the number of words with replacement words in the text being the sum of the number of replacement words per paragraph in the text, the number of replacement words per paragraph being the sum of the number of replacement words per sentence in the paragraph, the total number of words contained in the text being the sum of the number of words contained in each paragraph in the text, the number of words contained in each paragraph being the sum of the number of words of each sentence in the paragraph.
In the application, the paragraph proportion without the replacement word in the text, the sentence proportion without the replacement word in the text and the vocabulary proportion without the replacement word in the text can be combined to generate the chapter level replacement word feature vector of the text.
In this step, the number of sentences with the replacement words in each paragraph and the number of the replacement words in each paragraph can be counted according to the replacement words in the text and the text information, and the paragraph level replacement word feature vector of the text can be obtained according to the number of sentences with the replacement words in each paragraph and the number of the replacement words in each paragraph.
As an alternative embodiment, a paragraph level replacement word feature vector for text may consist of a proportion of sentences in the paragraph that have no replacement words, a proportion of non-replacement words in the paragraph, wherein,
sentence ratio of one paragraph without replacement words = 1-number of sentences with replacement words in the paragraph/total number of sentences in the paragraph;
the proportion of non-replacement words for a paragraph = 1-the number of replacement words in the paragraph/the total number of words in the paragraph, which is the sum of the number of words contained in each sentence in the paragraph.
In the application, the sentence proportion of one paragraph without the replacement word and the proportion of one paragraph without the replacement word can be combined to obtain the replacement word feature vector of one paragraph, and the replacement word feature vectors of all paragraphs are spliced to obtain the paragraph level replacement word feature vector of the text.
For ease of understanding, the following examples are given in the present disclosure:
assuming that a text has 5 paragraphs, each paragraph has 10 sentences, the total vocabulary number is 500, if there are replacement words in 3 paragraphs, there are replacement words in 15 sentences, and a total of 20 replacement words.
Then, the paragraph ratio without the replacement word in the text=1-3/5=2/5;
sentence ratio without replacement words in the text = 1-15/50 = 7/10;
the vocabulary ratio of non-replacement words in the text is= 1-20/500=24/25;
the text has a chapter level replacement word feature vector of [2/5,7/10,24/25].
Assuming that the replacement term feature vectors of the 5 paragraphs in the text are [1/10,2/5], [0,0], [3/10,2/5], [1/20,1/7], [3/20,3/7], respectively, and splicing the replacement term feature vectors of the 5 paragraphs to obtain the paragraph level replacement term feature vector of the text: [1/10,2/5,0,0,3/10,2/5,1/20,1/7,3/20,3/7].
S1004: and splicing the text chapter level replacement word feature vector with the text paragraph level replacement word feature vector to obtain the text replacement word feature vector.
For ease of understanding, the following examples are given in the present disclosure:
splicing the chapter level replacement word feature vector of the text in the step S1003 with the paragraph level replacement word feature vector of the text to obtain the replacement word feature vector of the text: [2/5,7/10,24/25,1/10,2/5,0,0,3/10,2/5,1/20,1/7,3/20,3/7].
The description of the device for recognizing the misclassification word disclosed in the embodiment of the present application is provided below, and the device for recognizing the misclassification word described below and the method for recognizing the misclassification word described above can be correspondingly referred to each other.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a device for recognizing a misclassification word according to an embodiment of the present application. As shown in fig. 4, the misclassification apparatus may include:
an obtaining unit 11, configured to obtain an original word sub-string in an original text to be identified, where the original word sub-string is generated by continuous word combinations in the original text;
a determining unit 12, configured to determine a new multi-word corresponding to the original text according to the original word substring;
and a recognition unit 13, configured to recognize a misclassification word in the original text based on the newly added multi-word.
Optionally, the determining unit includes:
a new word sub-string obtaining unit, configured to obtain a new word sub-string corresponding to the original word sub-string, where the length and the tone-free pinyin of the new word sub-string are the same as those of the original word sub-string;
a new sentence acquisition unit, configured to replace the original word sub-string with the new word sub-string, to obtain a new sentence corresponding to the original text;
a multi-word obtaining unit, configured to obtain a multi-word of the new sentence corresponding to the original text;
And the new multi-word determining unit is used for determining the new multi-word from the multi-word of the new sentence corresponding to the original text.
Optionally, the new sentence acquisition unit includes:
a new sentence to be selected obtaining unit, configured to replace the original word sub-string with the new word sub-string, to obtain a new sentence to be selected corresponding to the original text;
and a new sentence selecting unit, configured to select the new sentence corresponding to the original text from the new sentences to be selected.
Optionally, the new multi-word determining unit includes:
a word frequency acquisition unit, configured to acquire a word frequency of each multi-word in the multi-word of the new sentence corresponding to the original text;
the new multi-word selecting unit is used for arranging the multi-words of the new sentences corresponding to the original text in a descending order according to word frequency, and selecting a preset number of multi-words which are ranked at the front as the new multi-word.
Optionally, the identifying unit includes:
a target new sentence selecting unit, configured to select a target new sentence from the new sentences corresponding to the original text, where the target new sentence is a new sentence containing the new added multi-word;
A replacement word obtaining unit, configured to obtain a replacement word corresponding to the original text based on the target new sentence;
and the error word determining unit is used for determining that the character substring corresponding to the replacement word in the original text is the error word in the original text.
Optionally, the replacement word obtaining unit includes:
a new text obtaining unit, configured to replace an original sentence corresponding to the target new sentence in the original text with the target new sentence, and obtain a new text corresponding to the original text;
a new text newly increased word obtaining unit, configured to obtain a new increased word in the new text corresponding to the original text;
and the replacement word determining unit is used for determining the replacement word corresponding to the original text according to the new multi-word in the new text corresponding to the original text.
Optionally, the new text obtaining unit includes:
the new text set to be selected generating unit is used for replacing an original sentence corresponding to the target new sentence in the original text by the target new sentence to generate a new text set to be selected;
a similarity calculating unit, configured to calculate a similarity between each new text in the new text set to be selected and the original text;
And a new text selection unit, configured to select, according to the similarity between each new text and the original text, the new text corresponding to the original text from the new text set to be selected.
Optionally, the similarity calculating unit includes:
the obtaining unit is used for obtaining the text information feature vector of each new text and the feature characterization vector of each new text; acquiring a text information feature vector of the original text and a feature characterization vector of the original text;
the computing unit is used for computing the similarity between each new text and the original text according to the text information feature vector of each new text, the feature characterization vector of each new text, the text information feature vector of the original text and the feature characterization vector of the original text.
Optionally, the acquiring unit is specifically configured to:
acquiring text information feature vectors, new vocabulary feature vectors and replacement word feature vectors;
and generating the feature characterization vector according to the text information feature vector, the new vocabulary feature vector and the replacement word feature vector.
Optionally, the new text selection unit is specifically configured to:
And arranging the new texts to be selected in a descending order according to the similarity, and selecting a preset number of new texts with the front sequence and/or selecting a preset number of new texts with the rear sequence as the new texts corresponding to the original texts.
Fig. 5 is a block diagram of a hardware structure of a system for recognizing a misclassification word according to an embodiment of the present application, and referring to fig. 5, the hardware structure of the system for recognizing a misclassification word may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete the communication with each other through the communication bus 4;
processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present application, etc.;
the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;
wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:
Acquiring an original word sub-string in an original text to be identified, wherein the original word sub-string is generated by continuous word combination in the original text;
determining a new multi-word corresponding to the original text according to the original word substring;
and identifying the misclassified words in the original text based on the newly added multi-word.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
The embodiment of the present application also provides a storage medium storing a program adapted to be executed by a processor, the program being configured to:
acquiring an original word sub-string in an original text to be identified, wherein the original word sub-string is generated by continuous word combination in the original text;
determining a new multi-word corresponding to the original text according to the original word substring;
and identifying the misclassified words in the original text based on the newly added multi-word.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A method for identifying a misidentification word, comprising:
acquiring an original word sub-string in an original text to be identified, wherein the original word sub-string is generated by continuous word combination in the original text;
determining a new multi-word corresponding to the original text according to the original word substring;
identifying a misother word in the original text based on the newly added multi-word;
Wherein the determining, according to the original word sub-string, a new multi-word corresponding to the original text includes:
acquiring a new word sub-string corresponding to the original word sub-string, wherein the length and the tone-free pinyin of the new word sub-string and the original word sub-string are the same;
replacing the original word sub-string with the new word sub-string to obtain a new sentence corresponding to the original text;
acquiring the multi-word of the new sentence corresponding to the original text;
and determining the new multi-word from the multi-word words of the new sentence corresponding to the original text.
2. The method of claim 1, wherein replacing the original word sub-string with the new word sub-string results in a new sentence corresponding to the original text, comprising:
replacing the original word substring with the new word substring to obtain a new sentence to be selected, which corresponds to the original text;
and selecting a new sentence corresponding to the original text from the new sentences to be selected.
3. The method of claim 1, wherein the determining the new multi-word from the multi-word of the new sentence corresponding to the original text comprises:
acquiring word frequency of each multi-word in the multi-word of the new sentence corresponding to the original text;
And arranging the multi-word words of the new sentence corresponding to the original text in a descending order according to word frequency, and selecting a preset number of multi-word words which are ranked at the front as the new multi-word words.
4. The method of claim 2, wherein the identifying the miscord in the original text based on the newly added multi-word comprises:
selecting a target new sentence from the new sentences corresponding to the original text, wherein the target new sentence is a new sentence containing the new added multi-word;
acquiring a replacement word corresponding to the original text based on the target new sentence;
and determining the character substring corresponding to the replacement word in the original text as the misclassification word in the original text.
5. The method of claim 4, wherein the obtaining, based on the target new sentence, a replacement term corresponding to the original text comprises:
replacing an original sentence corresponding to the target new sentence in the original text with the target new sentence to obtain a new text corresponding to the original text;
acquiring new multi-word in the new text corresponding to the original text;
and determining the replacement word corresponding to the original text according to the new multi-word in the new text corresponding to the original text.
6. The method of claim 5, wherein replacing the original sentence corresponding to the target new sentence in the original text with the target new sentence, obtaining the new text corresponding to the original text, comprises:
replacing an original sentence corresponding to the target new sentence in the original text by the target new sentence to generate a new text set to be selected;
calculating the similarity between each new text in the new text set to be selected and the original text;
and selecting the new text corresponding to the original text from the new text set to be selected according to the similarity between each new text and the original text.
7. The method of claim 6, wherein the calculating the similarity of each new text in the set of candidate new texts to the original text comprises:
acquiring a text information feature vector of each new text and a feature characterization vector of each new text;
acquiring a text information feature vector of the original text and a feature characterization vector of the original text;
and calculating the similarity between each new text and the original text according to the text information feature vector of each new text, the feature characterization vector of each new text, the text information feature vector of the original text and the feature characterization vector of the original text.
8. The method of claim 7, wherein obtaining a feature characterization vector comprises:
acquiring text information feature vectors, new vocabulary feature vectors and replacement word feature vectors;
and generating the feature characterization vector according to the text information feature vector, the new vocabulary feature vector and the replacement word feature vector.
9. The method of claim 6, wherein selecting the new text corresponding to the original text from the set of new texts to be selected according to the similarity between each new text and the original text, comprises:
and arranging the new texts to be selected in a descending order according to the similarity, and selecting a preset number of new texts with the front sequence and/or selecting a preset number of new texts with the rear sequence as the new texts corresponding to the original texts.
10. A misclassification device, comprising:
the device comprises an acquisition unit, a recognition unit and a recognition unit, wherein the acquisition unit is used for acquiring an original word sub-string in an original text to be recognized, and the original word sub-string is generated by continuous word combination in the original text;
the determining unit is used for determining a new multi-word corresponding to the original text according to the original word substring;
The recognition unit is used for recognizing the misclassified words in the original text based on the newly added multi-word words;
wherein the determining unit includes:
a new word sub-string obtaining unit, configured to obtain a new word sub-string corresponding to the original word sub-string, where the length and the tone-free pinyin of the new word sub-string are the same as those of the original word sub-string;
a new sentence acquisition unit, configured to replace the original word sub-string with the new word sub-string, to obtain a new sentence corresponding to the original text;
a multi-word obtaining unit, configured to obtain a multi-word of the new sentence corresponding to the original text;
and the new multi-word determining unit is used for determining the new multi-word from the multi-word of the new sentence corresponding to the original text.
11. A system for recognizing a misclassification word, comprising a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the respective steps of the method for recognizing a misclassification word as claimed in any one of claims 1 to 9.
12. A readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method for recognizing a misclassification word as claimed in any one of claims 1 to 9.
CN201911293645.8A 2019-12-16 2019-12-16 Error word recognition method, related device and readable storage medium Active CN111209748B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911293645.8A CN111209748B (en) 2019-12-16 2019-12-16 Error word recognition method, related device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911293645.8A CN111209748B (en) 2019-12-16 2019-12-16 Error word recognition method, related device and readable storage medium

Publications (2)

Publication Number Publication Date
CN111209748A CN111209748A (en) 2020-05-29
CN111209748B true CN111209748B (en) 2023-10-24

Family

ID=70788033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911293645.8A Active CN111209748B (en) 2019-12-16 2019-12-16 Error word recognition method, related device and readable storage medium

Country Status (1)

Country Link
CN (1) CN111209748B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1340804A (en) * 2000-08-30 2002-03-20 国际商业机器公司 Automatic new term fetch method and system
CN104991889A (en) * 2015-06-26 2015-10-21 江苏科技大学 Fuzzy word segmentation based non-multi-character word error automatic proofreading method
CN107203511A (en) * 2017-05-27 2017-09-26 中国矿业大学 A kind of network text name entity recognition method based on neutral net probability disambiguation
WO2019214145A1 (en) * 2018-05-10 2019-11-14 平安科技(深圳)有限公司 Text sentiment analyzing method, apparatus and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1340804A (en) * 2000-08-30 2002-03-20 国际商业机器公司 Automatic new term fetch method and system
CN104991889A (en) * 2015-06-26 2015-10-21 江苏科技大学 Fuzzy word segmentation based non-multi-character word error automatic proofreading method
CN107203511A (en) * 2017-05-27 2017-09-26 中国矿业大学 A kind of network text name entity recognition method based on neutral net probability disambiguation
WO2019214145A1 (en) * 2018-05-10 2019-11-14 平安科技(深圳)有限公司 Text sentiment analyzing method, apparatus and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张仰森 ; 曹元大 ; 俞士汶 ; .基于规则与统计相结合的中文文本自动查错模型与算法.中文信息学报.2006,(04),全文. *

Also Published As

Publication number Publication date
CN111209748A (en) 2020-05-29

Similar Documents

Publication Publication Date Title
Van Strien et al. Assessing the impact of OCR quality on downstream NLP tasks
CN106649818B (en) Application search intention identification method and device, application search method and server
CN110674629B (en) Punctuation mark labeling model, training method, training equipment and storage medium thereof
CN109582972B (en) Optical character recognition error correction method based on natural language recognition
CN113268995B (en) Chinese academy keyword extraction method, device and storage medium
CN109800414B (en) Method and system for recommending language correction
US7979268B2 (en) String matching method and system and computer-readable recording medium storing the string matching method
CN111709243A (en) Knowledge extraction method and device based on deep learning
CN108255813B (en) Text matching method based on word frequency-inverse document and CRF
CN108304373B (en) Semantic dictionary construction method and device, storage medium and electronic device
JP5710581B2 (en) Question answering apparatus, method, and program
WO2009035863A2 (en) Mining bilingual dictionaries from monolingual web pages
CN107679070B (en) Intelligent reading recommendation method and device and electronic equipment
CN112711948A (en) Named entity recognition method and device for Chinese sentences
US20140289238A1 (en) Document creation support apparatus, method and program
CN110413764A (en) Long text enterprise name recognizer based on built in advance dictionary
CN111611775A (en) Entity identification model generation method, entity identification method, device and equipment
Kettunen et al. Analyzing and improving the quality of a historical news collection using language technology and statistical machine learning methods
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
Sfikas et al. Using attributes for word spotting and recognition in polytonic greek documents
Romero et al. Modern vs diplomatic transcripts for historical handwritten text recognition
CN111133429A (en) Extracting expressions for natural language processing
JP5152918B2 (en) Named expression extraction apparatus, method and program thereof
CN112434533A (en) Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium
CN111209748B (en) Error word recognition method, related device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant