CN111209748B

CN111209748B - Error word recognition method, related device and readable storage medium

Info

Publication number: CN111209748B
Application number: CN201911293645.8A
Authority: CN
Inventors: 王培养; 方昕; 李永超; 刘海波; 方磊
Original assignee: Hefei Ustc Iflytek Co ltd
Current assignee: Hefei Ustc Iflytek Co ltd
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2023-10-24
Anticipated expiration: 2039-12-16
Also published as: CN111209748A

Abstract

The application discloses a method for identifying a misclassification word, a related device and a readable storage medium, wherein an original text to be identified possibly has continuous words due to a word or a plurality of word writing errors. Therefore, based on the scheme of the application, the recognition of the misclassification word in the text can be realized.

Description

Error word recognition method, related device and readable storage medium

Technical Field

The present application relates to the field of natural language processing, and more particularly, to a method for recognizing a misother word, a related device, and a readable storage medium.

Background

With the rapid development of information processing technology and internet, traditional text work is almost completely replaced by computers, and electronic texts such as electronic books, electronic newspapers, electronic mails, office files, blogs, microblogs and the like become a part of daily life of people, however, text wrongly written words are more and more, and finding wrongly written words is an important application aspect of text quality inspection.

Therefore, how to identify the miswords in the text is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the foregoing, the present application provides a method, related apparatus, and readable storage medium for recognizing a misclassification word. The specific scheme is as follows:

a method of identifying a misclassification word, comprising:

acquiring an original word sub-string in an original text to be identified, wherein the original word sub-string is generated by continuous word combination in the original text;

determining a new multi-word corresponding to the original text according to the original word substring;

and identifying the misclassified words in the original text based on the newly added multi-word.

Optionally, the determining, according to the original word substring, a new added multi-word corresponding to the original text includes:

acquiring a new word sub-string corresponding to the original word sub-string, wherein the length and the tone-free pinyin of the new word sub-string and the original word sub-string are the same;

replacing the original word sub-string with the new word sub-string to obtain a new sentence corresponding to the original text;

acquiring the multi-word of the new sentence corresponding to the original text;

and determining the new multi-word from the multi-word words of the new sentence corresponding to the original text.

Optionally, the replacing the original word sub-string with the new word sub-string to obtain a new sentence corresponding to the original text includes:

replacing the original word substring with the new word substring to obtain a new sentence to be selected, which corresponds to the original text;

and selecting a new sentence corresponding to the original text from the new sentences to be selected.

Optionally, the determining the new multi-word from the multi-word words of the new sentence corresponding to the original text includes:

acquiring word frequency of each multi-word in the multi-word of the new sentence corresponding to the original text;

and arranging the multi-word words of the new sentence corresponding to the original text in a descending order according to word frequency, and selecting a preset number of multi-word words which are ranked at the front as the new multi-word words.

Optionally, the identifying the misclassified word in the original text based on the newly added multi-word includes:

selecting a target new sentence from the new sentences corresponding to the original text, wherein the target new sentence is a new sentence containing the new added multi-word;

acquiring a replacement word corresponding to the original text based on the target new sentence;

And determining the character substring corresponding to the replacement word in the original text as the misclassification word in the original text.

Optionally, the acquiring, based on the target new sentence, a replacement word corresponding to the original text includes:

replacing an original sentence corresponding to the target new sentence in the original text with the target new sentence to obtain a new text corresponding to the original text;

acquiring new multi-word in the new text corresponding to the original text;

and determining the replacement word corresponding to the original text according to the new multi-word in the new text corresponding to the original text.

Optionally, the replacing the original sentence corresponding to the target new sentence in the original text with the target new sentence to obtain a new text corresponding to the original text includes:

replacing an original sentence corresponding to the target new sentence in the original text by the target new sentence to generate a new text set to be selected;

calculating the similarity between each new text in the new text set to be selected and the original text;

and selecting the new text corresponding to the original text from the new text set to be selected according to the similarity between each new text and the original text.

Optionally, the calculating the similarity between each new text in the new text set to be selected and the original text includes:

acquiring a text information feature vector of each new text and a feature characterization vector of each new text;

acquiring a text information feature vector of the original text and a feature characterization vector of the original text;

and calculating the similarity between each new text and the original text according to the text information feature vector of each new text, the feature characterization vector of each new text, the text information feature vector of the original text and the feature characterization vector of the original text.

Optionally, obtaining the feature characterization vector includes:

acquiring text information feature vectors, new vocabulary feature vectors and replacement word feature vectors;

and generating the feature characterization vector according to the text information feature vector, the new vocabulary feature vector and the replacement word feature vector.

Optionally, the selecting the new text corresponding to the original text from the new text set to be selected according to the similarity between each new text and the original text includes:

and arranging the new texts to be selected in a descending order according to the similarity, and selecting a preset number of new texts with the front sequence and/or selecting a preset number of new texts with the rear sequence as the new texts corresponding to the original texts.

A misclassification device comprising:

the device comprises an acquisition unit, a recognition unit and a recognition unit, wherein the acquisition unit is used for acquiring an original word sub-string in an original text to be recognized, and the original word sub-string is generated by continuous word combination in the original text;

the determining unit is used for determining a new multi-word corresponding to the original text according to the original word substring;

and the identification unit is used for identifying the misclassified words in the original text based on the newly added multi-word words.

Optionally, the determining unit includes:

a new word sub-string obtaining unit, configured to obtain a new word sub-string corresponding to the original word sub-string, where the length and the tone-free pinyin of the new word sub-string are the same as those of the original word sub-string;

a new sentence acquisition unit, configured to replace the original word sub-string with the new word sub-string, to obtain a new sentence corresponding to the original text;

a multi-word obtaining unit for obtaining the multi-word of the new sentence corresponding to the original text;

and the new multi-word determining unit is used for determining the new multi-word from the multi-word of the new sentence corresponding to the original text.

Optionally, the new sentence acquisition unit includes:

a new sentence to be selected obtaining unit, configured to replace the original word sub-string with the new word sub-string, to obtain a new sentence to be selected corresponding to the original text;

And a new sentence selecting unit, configured to select the new sentence corresponding to the original text from the new sentences to be selected.

Optionally, the new multi-word determining unit includes:

a word frequency acquisition unit, configured to acquire a word frequency of each multi-word in the multi-word of the new sentence corresponding to the original text;

the new multi-word selecting unit is used for arranging the multi-words of the new sentences corresponding to the original text in a descending order according to word frequency, and selecting a preset number of multi-words which are ranked at the front as the new multi-word.

Optionally, the identifying unit includes:

a target new sentence selecting unit, configured to select a target new sentence from the new sentences corresponding to the original text, where the target new sentence is a new sentence containing the new added multi-word;

a replacement word obtaining unit, configured to obtain a replacement word corresponding to the original text based on the target new sentence;

and the error word determining unit is used for determining that the character substring corresponding to the replacement word in the original text is the error word in the original text.

Optionally, the replacement word obtaining unit includes:

A new text obtaining unit, configured to replace an original sentence corresponding to the target new sentence in the original text with the target new sentence, and obtain a new text corresponding to the original text;

a new text newly increased word obtaining unit, configured to obtain a new increased word in the new text corresponding to the original text;

and the replacement word determining unit is used for determining the replacement word corresponding to the original text according to the new multi-word in the new text corresponding to the original text.

Optionally, the new text obtaining unit includes:

the new text set to be selected generating unit is used for replacing an original sentence corresponding to the target new sentence in the original text by the target new sentence to generate a new text set to be selected;

a similarity calculating unit, configured to calculate a similarity between each new text in the new text set to be selected and the original text;

and a new text selection unit, configured to select, according to the similarity between each new text and the original text, the new text corresponding to the original text from the new text set to be selected.

Optionally, the similarity calculating unit includes:

The obtaining unit is used for obtaining the text information feature vector of each new text and the feature characterization vector of each new text; acquiring a text information feature vector of the original text and a feature characterization vector of the original text;

the computing unit is used for computing the similarity between each new text and the original text according to the text information feature vector of each new text, the feature characterization vector of each new text, the text information feature vector of the original text and the feature characterization vector of the original text.

Optionally, the acquiring unit is specifically configured to:

Optionally, the new text selection unit is specifically configured to:

A misclassification word recognition system comprising a memory and a processor;

The memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the method for recognizing a misclassification word as described above.

A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a method for identifying a misclassification word as described above.

By means of the technical scheme, the application discloses a method for identifying a multi-word, related equipment and a readable storage medium, wherein the original text to be identified possibly has continuous words due to a word or a plurality of word writing errors, in the method, an original word sub-string is generated by acquiring the continuous word combination in the original text, and a plurality of continuous words corresponding to the original text can be determined by the newly-added multi-word according to the original word sub-string. Therefore, based on the scheme of the application, the recognition of the misclassification word in the text can be realized.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a schematic flow chart of a method for recognizing a misclassification word according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a method for obtaining a new word sub-string corresponding to an original word sub-string according to an embodiment of the present application;

FIG. 3 is a schematic diagram of obtaining at least one new sentence corresponding to each original sentence according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a device for recognizing a misclassification word according to an embodiment of the present application;

fig. 5 is a block diagram of a hardware structure of a system for recognizing a misclassification word according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

For reference and clarity, the description and summary of terms used in the embodiments of the present application are as follows:

word: single words and single words are referred to;

multi-word: words of 2 and above;

Word substring: the character string which is not formed into words continuously can be formed by 1 character, 2 or more characters;

pinyin substring: a substring sequence formed by the non-modulated pinyin for the single word pairs of the word substring;

misidentification words: the misidentification word or sensitive word in any text, especially 2 or more continuous words;

general vocabulary dictionary: a dictionary formed by the universal vocabulary of 2 words and above;

business vocabulary dictionary: a dictionary formed by words in a specific field of 2 words and above;

disabling the word dictionary: 2 words and above general word resources;

chinese character-Pinyin resource: realizing that a Chinese character searches for the corresponding pinyin or the non-modulated pinyin;

pinyin-chinese character resource: realizing one pinyin/non-aligned pinyin to find all corresponding Chinese characters;

non-tone pinyin: pinyin without tone;

multi-word dictionary: dictionary formed by 2 words and above, as shown in the following table:

snowman fight
	Snowman fort
Ice hockey
	…

Misclassification word mapping resource: and a resource for establishing a relationship between the misidentified word and the word actually expressing the meaning. The following table shows:

misclassification word	Correct words
		School bus station	Snowman fight
Bag epitaxy for school students	Snowman fort
		Aronia melanocarpa (Roxb.) kuntze	Ice hockey
…	…

The method for recognizing the misclassification of the present disclosure will be described in detail by the following examples.

Referring to fig. 1, fig. 1 is a flow chart of a method for recognizing a misclassification word, which includes the following steps:

s101: and acquiring an original word substring in the original text to be identified.

In the present application, the original word substring is generated from successive word combinations in the original text. The method for obtaining the original word sub-string in the original text can be various, for example, for the case that the original word sub-string of the original text to be recognized is already generated in advance, the original word sub-string can be directly obtained, and for the case that the original word sub-string of the original text to be recognized is not generated temporarily, the original word sub-string can be obtained by generating the original word sub-string.

S102: and determining a new multi-word corresponding to the original text according to the original word substring.

In the present application, for each original word sub-string, the newly added multi-word determined based on the original word sub-string is a multi-word that the original word sub-string does not have in the original sentence at the corresponding position in the original text. It should be noted that, in the present application, all the multi-word words corresponding to the original word sub-string may be determined according to the original word sub-string, and these multi-word words are compared with the multi-word words of the original sentence in the original text where the original word sub-string is located, so that it is determined that some of these multi-word words but none of these multi-word words are used as the newly added multi-word words corresponding to the original text. Alternatively, a multi-word satisfying a certain condition may be further selected from among the multi-words, but not from the original sentence, as a newly added multi-word corresponding to the original text, which is not limited in any way.

S103: and identifying the misclassified words in the original text based on the newly added multi-word.

In the application, the character sub-string corresponding to the newly added multi-character word in the original text can be determined to be the error word in the original text. The word sub-strings meeting certain conditions can be further screened from the word sub-strings corresponding to the new multi-word in the original text to be the misclassified words in the original text, and the application is not limited in any way.

In the method, an original word sub-string is generated by acquiring the combination of the continuous words in the original text, and a new multi-word corresponding to the original text is acquired according to the original word sub-string, and a plurality of continuous words corresponding to the new multi-word in the original text can be determined by the new multi-word. Therefore, based on the scheme of the application, the recognition of the misclassification word in the text can be realized.

As an embodiment, in the present application, a method for generating an original word sub-string of an original text is disclosed, which may specifically include the following steps:

S201: and segmenting each original sentence in the original text to obtain a word sequence of each original sentence.

In the application, a word segmentation model trained in advance can be adopted to segment each original sentence in the original text, so as to obtain a word sequence of each original sentence.

For ease of understanding, assume that one of the original sentences of one original paragraph in one original text is: "little child and child are standing at school, influence the process of the small flower's Aronia melanocarpa and the construction of the bag of the child of the Arman. "for this original sentence, word segmentation processing is performed on the original sentence, so as to obtain the word sequence of the original sentence, which is specifically" Xiaoming\and\darling\positive\in\learning\man\standing\, the method comprises the steps of affecting the process of the microflower, the stem, the process of the catalpa, the learning of the Arman, the inclusion of the epitaxy and the construction. And \is carried out.

It should be noted that, based on the word sequences of all the original sentences, the word sequences of the original text can be obtained.

For easy understanding, a piece of original text is preset, and divided into m original paragraphs P according to a paragraph structure ₁ ，P ₂ ，...，P _i ，...，P _m For an original paragraph P _i By clause symbol "; . The following is carried out ? "clause, n can be obtained _i Original sentencesFor the original sentence S _ij And performing word segmentation processing to obtain characters and multi-character words. These words and multi-word are the word sequence of the original text +.>In the application, i represents paragraph index, the value is an integer of more than or equal to 1 and less than or equal to m, j represents sentence index, the value is more than or equal to 1 and less than or equal to n _i T represents word index, and the value is more than or equal to 1 and less than or equal to P _ij Is an integer of (a).

S202: and matching the word sequence of each original sentence with a preset multi-word dictionary, and determining the words in the word sequence of each original sentence.

In the application, the preset multi-word dictionary can be specifically any one or more of a universal vocabulary dictionary, a business vocabulary dictionary, a dead word dictionary and the like. And determining successfully matched multi-word words in the word sequence of each original sentence, and unsuccessfully matched words in the word sequence of each original sentence by matching the word sequence of each original sentence with a preset multi-word dictionary. It should be noted that the multi-word words of all the original sentences constitute the multi-word words of the original text.

For easy understanding, the word sequences of the original sentences are named' Xiaoming\and\non-child\at\learning\man\station\, the method comprises the steps of affecting the process of the microflower, the stem, the process of the catalpa, the learning of the Arman, the inclusion of the epitaxy and the construction. After matching the \ "with the preset multi-word dictionary, the words in the word sequence of the original sentence can be determined as" sum "," positive "," beating "," learning "," person "," standing "," handle "," stick "," catalpa "," and "," learning "," person "," bag "," epitaxy ".

S203: and merging continuous words in the word sequence of each original sentence to obtain an original word substring in the original text.

In the application, the original word substring in each original sentence can be obtained by merging the continuous words in the word sequence of each original sentence, and for convenience of understanding, the continuous words of the word sequences of the original sentences are respectively combined in the ways of beating, learning, standing, handle, stick, mansari, epitaxy, so that the original word substrings in the original text can be obtained: "and", "standing in the middle of school", "mountain ash", "and", "school bag epitaxy", ". ".

It should be noted that, based on the original word sub-strings in all the original sentences, the original word sub-strings in the original text can be obtained. In the present application, SC can be used _ijk Representing an original word sub-string in the original text, wherein k represents an index of the original word sub-string, and the value is greater than or equal to 1 and less than or equal to q _ij Is an integer of (a).

As an implementation manner, in the present application, an implementation manner of determining a new added multi-word corresponding to an original text according to an original word sub-string is disclosed, and the method may include the following steps:

S301: and acquiring a new word sub-string corresponding to the original word sub-string.

In the application, the new character substring has the same length and the tone-free pinyin as the original character substring. As an implementation manner, the present application discloses an implementation manner for obtaining a new word sub-string corresponding to an original word sub-string, which may include the following steps:

s401: and obtaining the non-modulated pinyin sequence corresponding to the original character substring.

In the application, the original character sub-string can be split by taking the character as a unit to obtain the character sequence of the original character sub-string, each character in the sub-sequence is searched for, the non-modulated pinyin of each character is obtained by searching the Chinese character-pinyin resource, and the non-modulated pinyin sequence corresponding to the original character sub-string is further obtained.

S402: and acquiring a word network corresponding to the non-modulated pinyin sequence.

In the application, for each pinyin in the non-modulated pinyin sequence, all words corresponding to each pinyin are obtained by searching pinyin-Chinese character resources, and all words corresponding to all the pinyin are formed into a word network by taking the pinyin sequence as a time sequence.

S403: and extracting a word from each moment of the word network according to the time sequence relation of the non-modulated pinyin sequence, and combining to obtain a new word substring corresponding to the original word substring.

According to the application, a word can be extracted from each moment of the word network according to the time sequence relation of the non-modulated pinyin sequence for combination to obtain a new word sub-string corresponding to the original word sub-string, and all the new word sub-strings corresponding to the original word sub-string can be obtained through multiple combinations.

For easy understanding, referring to fig. 2, fig. 2 is a schematic diagram of a method for obtaining a new word string corresponding to an original word string according to an embodiment of the present application, in fig. 2, the "stick-stick" of the original word string "is taken as an example to describe, and as can be seen in fig. 2, the" stick-stick "of the word string" can be split into a word sequence "; by searching the Chinese character-pinyin resource, the pinyin sequence 'de/di-bing-gun-qiu' of the character sequence is obtained; looking up pinyin-Chinese character resources to obtain all the characters of each pinyin and generate a character network. By connecting any word in each row of the word network with any word in the next moment, a new word substring of the word substring 'Aronia melanocarpa' can be generated, specifically, a 'hockey ball, cake roller fur, and jockey roller …' shown in fig. 2.

Based on the above, a new word sub-string corresponding to each original word sub-string can be obtained.

S302: and replacing the original word substring with the new word substring to obtain a new sentence corresponding to the original text.

In the application, at least one original word sub-string in each original sentence can be replaced by a new word sub-string corresponding to the at least one original word sub-string, so as to obtain at least one new sentence corresponding to each original sentence.

For ease of understanding, please refer to fig. 3, fig. 3 is a schematic diagram of obtaining at least one new sentence corresponding to each original sentence according to an embodiment of the present application, and in fig. 3, the original word strings "and", "standing on learning," and "," learning package "are used. As shown in fig. 3, the original word sub-strings are "and", "drink", "what" …, the original word sub-strings are "playing a snowman fight", "playing a learning man fight", "finishing a learning man fight" …, the new word sub-strings are "ice hockey", "sick roller haunch", "binker fur" …, the new word sub-strings are "and", "drink", "what" …, and the learning man packet of the original word sub-strings "is" snowman fort "," learning benevolence fort "," … ….

By the ' standing at the beginning of learning ' of the original character sub-string, the ' corresponding new character sub-string ' is in the battle of snowman, ' replacing ' the standing at the beginning of learning ', the ' replacing ' of the hockey stick of the corresponding new character sub-string ' of the original character sub-string ', and replacing the snowman fort of the new character string corresponding to the original character string by the snowman fort of the original character string to obtain a new sentence of ' little and child ' shown in figure 3, namely, the snowman fort which affects the progress of the small-flower hockey and the construction of the arom. By the same token, other new sentences, such as the new sentence 'little bright drinking child' in fig. 3, can be generated by replacing the corresponding original word substring with different new word substrings, and the disease stick process of floret and the learning fort construction of the aro are affected. "

In the present application, a new sentence corresponding to an original text may be obtained based on at least one new sentence corresponding to each original sentence, and as an implementation manner, a specific implementation manner of replacing the original word sub-string with the new word sub-string to obtain a new sentence corresponding to the original text is disclosed in the present application, that is, all new sentences corresponding to all original sentences are used as new sentences corresponding to the original text.

However, each original sentence may have a plurality of corresponding new sentences, and if all new sentences corresponding to all original sentences are used as new sentences corresponding to the original text, a large number of new sentences corresponding to the original text will be generated, and processing a large number of new sentences corresponding to the original text will affect the recognition efficiency of the misclassification words.

In order to improve recognition efficiency of misother words, as another implementation manner, the application discloses a specific implementation manner of replacing the original word sub-string with the new word sub-string to obtain a new sentence corresponding to the original text, namely, replacing the original word sub-string with the new word sub-string to obtain all new sentences corresponding to all original sentences as new sentences to be selected, and selecting the new sentences corresponding to the original text from the new sentences to be selected.

It should be noted that, as can be seen from fig. 3, in the multiple new sentences corresponding to each original sentence, some new word substrings in the new sentences do not include multi-word words, and the new sentences do not help to correct errors of the misclassified words, while the new word substrings in the new sentences include multi-word words, which can be used for identifying the misclassified words and correcting the misclassified words, and do not include multi-word words.

Therefore, as an implementation manner, the application discloses a realization method for selecting a new sentence corresponding to the original text from the new sentences to be selected, which specifically comprises the following steps: and selecting a new sentence containing the multi-word from the new sentences to be selected as a new sentence corresponding to the original text.

As still another embodiment, in the present application, there is further provided a method for selecting a new sentence corresponding to an original text from new sentences to be selected, which may specifically be as follows: and acquiring the confidence score of each original sentence and the confidence score of each new sentence corresponding to the original sentence, and selecting a new sentence with the confidence score larger than the confidence score of the original sentence from all new sentences corresponding to the original sentence as a new sentence corresponding to the original text. It should be noted that, in the present application, a pre-trained language model may be used to calculate a confidence score of each original sentence and a confidence score of each new sentence corresponding to the original sentence.

S303: and acquiring the multi-word of the new sentence corresponding to the original text.

In the application, each new sentence corresponding to the original text can be segmented to obtain the word sequence of each new sentence corresponding to the original text, the word sequence of each new sentence corresponding to the original text is matched with a preset multi-word dictionary, the multi-word dictionary in each new sentence corresponding to the original text is determined to be matched with the preset multi-word dictionary, and the multi-word words which are not in the multi-word words of the original sentence corresponding to the multi-word dictionary are the multi-word words of each new sentence corresponding to the original text, and all the multi-word words of the new sentence corresponding to the original text form the multi-word words of the new sentence corresponding to the original text. The preset multi-word dictionary may be one or more of a universal vocabulary dictionary, a business vocabulary dictionary, a dead word dictionary, and the like.

S304: and determining the new multi-word from the multi-word words of the new sentence corresponding to the original text.

In the present application, a realization method for selecting newly increased words from the multi-word words of the new sentence corresponding to the original text is also provided, and the method can be as follows: acquiring word frequency of each multi-word in the multi-word of the new sentence corresponding to the original text, arranging the multi-word words of the new sentence corresponding to the original text in a descending order according to the word frequency, and selecting a preset number of multi-word words with the top order as the newly added multi-word words;

or, acquiring word frequency of each multi-word in the multi-word of the new sentence corresponding to the original text; the multi-word words of the new sentences corresponding to the original text are arranged in a descending order according to word frequency, and a preset number of multi-word words which are ranked at the front are selected as the new multi-word words;

in the present application, the preset number may be determined according to the specific situation, for example, K number, where K is an integer greater than or equal to 1 and less than or equal to the total number of the multiple words of the new sentence corresponding to the original text. Or the preset number may be determined according to a percentage of the total number of the multi-word words of the new sentence corresponding to the original text, for example, a value of 20% of the total number of the multi-word words of the new sentence corresponding to the original text is the preset number.

As an embodiment, in the present application, an implementation of identifying a misword in an original text based on a newly added multi-word is disclosed, which may include the steps of:

s501: and selecting a target new sentence from the new sentences corresponding to the original text.

In the application, the target new sentence is a new sentence containing the newly added multi-word in the new sentence corresponding to the original text.

S502: and acquiring a replacement word corresponding to the original text based on the target new sentence.

In the application, the replacement words corresponding to the original text can be all newly added multi-word words or part of newly added multi-word words. The specific implementation will be described in detail by the following examples, which will not be described in detail here.

S503: and determining the character substring corresponding to the replacement word in the original text as the misclassification word in the original text.

In the application, after the error word in the original text is determined, the error word in the original text can be further replaced by the corresponding replacement word so as to correct the error word, and in addition, the position information of the error word in the original text and the corresponding correction result can be further displayed so as to assist manual verification.

As an embodiment, in the present application, an implementation manner of acquiring a replacement word corresponding to an original text based on a target new sentence is disclosed, which may include the steps of:

s601: and replacing the original sentence corresponding to the target new sentence in the original text by the target new sentence, and obtaining a new text corresponding to the original text.

In the application, at least one original sentence in the original text can be replaced by a target new sentence corresponding to the at least one original sentence, so that all the new texts are obtained. As an embodiment, all the new texts mentioned above may be determined as new texts corresponding to the original text. However, if the subsequent processing is performed on all the new texts, the recognition efficiency of the misclassification words is low, and some of the new texts possibly contain new misclassification words, which can cause adverse effects on the misclassification word recognition of the original text, so that part of the new texts can be selected from all the new texts, and the misclassification words of the original text can be recognized based on the part of the new texts, thereby improving the misclassification word recognition efficiency and accuracy.

As an implementation manner, the application provides another implementation manner for obtaining the new text corresponding to the original text, which can improve the recognition efficiency and accuracy of the misclassification word, and specifically can be as follows: and replacing an original sentence corresponding to the target new sentence in the original text with the target new sentence, generating a new text set to be selected, calculating the similarity between each new text in the new text set to be selected and the original text, and selecting the new text corresponding to the original text from the new text set to be selected according to the similarity between each new text and the original text.

It should be noted that, the new paragraph can be obtained by replacing the corresponding original sentence with at least one target new sentence. And replacing the corresponding original paragraph with at least one new paragraph to obtain new text. And obtaining a new text set to be selected through multiple different substitutions.

In addition, in the application, the new texts to be selected can be arranged in a descending order according to the similarity, and a preset number of new texts with the front sequence are selected, and/or a preset number of new texts with the rear sequence are selected as the new texts corresponding to the original texts.

S602: and acquiring the new multi-word in the new text corresponding to the original text.

In the application, the new multi-word in the new text corresponding to the original text is the new multi-word contained in the target new sentence in the new text corresponding to the original text.

S603: and determining the replacement word corresponding to the original text according to the new multi-word in the new text corresponding to the original text.

In the application, all newly increased words in the new text corresponding to the original text can be determined as the replacement words corresponding to the original text. A preset number of new multi-word words can also be selected from all the new multi-word words in the new text corresponding to the original text to be used as the replacement words corresponding to the original text. The present application is not limited in any way.

In the present application, a realization method for selecting the replacement word corresponding to the original text from all newly added multiple words in the new text corresponding to the original text is also provided, and the method may be: acquiring word frequency of each new multi-word in the new text corresponding to the original text or the original text; each new multi-word in the new text corresponding to the original text is arranged in a descending order according to word frequency, and a preset number of new multi-words with the front ordering are selected as the replacement words corresponding to the original text;

In the present application, the preset number may be determined according to the specific situation, for example, K number, where K is an integer greater than or equal to 1 and less than or equal to the total number of the new multi-word words in the new text corresponding to the original text. Or the preset number may be determined according to a percentage of the total number of the new multi-word words in the new text corresponding to the original text, for example, a value of 20% of the total number of the new multi-word words in the new text corresponding to the original text is the preset number.

As yet another embodiment, in the present application, an implementation of calculating a similarity between each new text in a set of new texts to be selected and an original text is disclosed, which may include the following steps:

s701: and acquiring the text information feature vector of each new text and the feature characterization vector of each new text.

In the present application, the text information feature vector of each new text may be composed of a word vector of each new text, a paragraph vector of each new text, and a chapter vector of each new text. In addition, the text information feature vector of each new text, the new vocabulary feature vector of each new text and the replacement word feature vector of each new text can be obtained first, and then the feature vector of each new text is generated according to the text information feature vector of each new text, the new vocabulary feature vector of each new text and the replacement word feature vector of each new text.

S702: and acquiring a text information feature vector of the original text and a feature characterization vector of the original text.

In the present application, the text information feature vector of the original text may be composed of a word vector of the original text, a paragraph vector of the original text, and a chapter vector of the original text. In addition, the text information feature vector of the original text, the new vocabulary feature vector of the original text and the replacement word feature vector of the original text can be obtained first, and then the feature characterization vector of the original text is generated according to the text information feature vector of the original text, the new vocabulary feature vector of the original text and the replacement word feature vector of the original text.

S703: and calculating the similarity between each new text and the original text according to the text information feature vector of each new text, the feature characterization vector of each new text, the text information feature vector of the original text and the feature characterization vector of the original text.

As an implementation manner, a first similarity between the text information feature vector of each new text and the text information feature vector of the original text and a second similarity between the feature vector of each new text and the feature vector of the original text may be calculated, and then the similarity between each new text and the original text is calculated according to the first similarity and the second similarity.

As an implementation manner, the first similarity and the second similarity may be accumulated, and the similarity between each new text and the original text may be calculated.

For ease of understanding, in the present application, it is assumed that each new text is represented by a "sample", the original text is represented by a "check material", andrepresenting a first similarity of the text information feature vector of each new text to the text information feature vector of said original text with +.>A second similarity of the feature vector representing each new text to the feature vector of the original text may then be based on the formula +.>And calculating the similarity between each new text and the original text.

In the application, an original text and a new text can be collectively called as a text, based on which, the application discloses a generation mode of a characteristic characterization vector of the text, and the mode can comprise the following steps:

s801: and acquiring a text information feature vector of the text.

In the present application, the text information feature vector may be composed of a word vector of the text, a paragraph vector of the text, and a chapter vector of the text.

S802: a new vocabulary feature vector of the text is obtained.

In the application, the dimension of the new vocabulary feature vector of the text is the same as the number of the new multi-word words, if one text contains one new multi-word, the dimension of the new vocabulary feature vector of the text corresponding to the new multi-word is 1, if the new multi-word is not contained, the dimension of the new vocabulary feature vector of the text corresponding to the new multi-word is 0, in addition, the method can also be represented by a reverse method, namely if a text contains a new multi-word, the dimension value corresponding to the new multi-word in the new vocabulary feature vector of the text is 0, and if the new multi-word is not contained, the dimension value corresponding to the new multi-word in the new vocabulary feature vector of the text is 1.

For the sake of understanding, it is assumed that the number of newly increased words is 2, namely "hockey ball and snowman fort", and it is assumed that the new words are expressed by adopting an inverse method, the dimension value of the corresponding "hockey ball" in the new vocabulary feature vector is 0 if the "hockey ball" is contained, the dimension value of the corresponding "hockey ball" in the new vocabulary feature vector is 1 if the "hockey ball" is not contained, the dimension value of the corresponding "snowman fort" in the new vocabulary feature vector is 0 if the "snowman fort" is contained, and the dimension value of the corresponding "snowman fort" in the new vocabulary feature vector is 1 if the "snowman fort" is not contained. Assuming that only "hockey" is included in one text, the new vocabulary feature vector of the new text is [0,1], only "snowman fort" is included in one text, the new vocabulary feature vector of the new text is [1,0], and the "hockey" and "snowman fort" are included in one text, and the new vocabulary feature vector of the new text is [0,0].

S803: and acquiring the character vector of the replacement word of the text.

The specific implementation of this step will be described in detail by the following examples, which will not be described in detail here.

S804: and splicing the new vocabulary feature vector of the text and the replacement word feature vector of the text to generate the generated feature vector of the text.

For ease of understanding, the following examples are given in the present disclosure:

assuming that the new vocabulary feature vector of a text is [0,1], the replacement vocabulary feature vector of the text is [2/5,7/10,24/25,1/10,2/5,0,0,3/10,2/5,1/20,1/7,3/20,3/7], the generated feature vector of the text is [1,0,2/5,7/10,24/25,1/10,2/5,0,0,3/10,2/5,1/20,1/7,3/20,3/7].

S805: integrating the text information feature vector of the text with the generated feature vector of the text to generate a feature characterization vector of the text.

Assume that the text information feature vector of the text is expressed as Vec _text The generated feature vector of the text is expressed as Vec _k-hot +Vec _3+2*p Wherein Vec _k-hot New vocabulary feature vectors, vec, expressed as text _3+2*p Expressed as a replacement term feature vector for the text, the feature characterization vector for the text is Vec _text ^T ×(Vec _k-hot +Vec _3+2*p )。

As an embodiment, in the present application, a text information feature vector generation method of a text is disclosed, specifically as follows:

s901: acquiring word vectors of the text, paragraph vectors of the text and chapter vectors of the text;

in the application, the word vector of each word in the text can be extracted through a pre-trained word vector model, and the word vectors of each word in the text are combined (using averaging or maximum value, etc.), so as to obtain the word vector of the text. Paragraph vectors of each paragraph in the text can be extracted through a pre-trained paragraph vector model, and the paragraph vectors of each paragraph in the text are combined (using averaging or maximum value, etc.), so that the paragraph vector of the text is obtained. The text chapter vector can also be extracted through a pre-trained chapter vector model.

S902: and splicing the word vector of the text, the paragraph vector of the text and the chapter vector of the text to form the text information feature vector of the text.

In the present application, assuming that the dimension of the word vector of the text is 300, the paragraph vector of the text is 100, and the dimension of the chapter vector of the text is 150, the dimension of the text information feature vector of the text is 300+100+150=550.

As an implementation manner, in the present application, a generation manner of a text replacement word feature vector is also disclosed, which specifically includes:

s1001: and determining replacement words in the text, wherein the replacement words are words in the newly added multi-word.

In the application, each vocabulary in the text can be matched with the newly added multi-word, and the successfully matched vocabulary is the replacement word.

S1002: and acquiring text information, wherein the text information comprises the number of paragraphs contained in the text, the number of sentences contained in each paragraph, and the number of vocabularies contained in each sentence.

S1003: and generating a chapter level replacement word feature vector of the text and a paragraph level replacement word feature vector of the text according to the replacement words in the text and the text information.

In this step, the number of paragraphs with the replacement words in the text and the text information can be counted, the number of sentences with the replacement words in the text can be counted, and the chapter level replacement word feature vector of the text can be obtained according to the number of paragraphs with the replacement words in the text, the number of sentences with the replacement words in the text and the number of the replacement words in the text.

As an implementation manner, the chapter level replacement word feature vector of the text may be composed of a paragraph proportion of no replacement word in the text, a sentence proportion of no replacement word in the text, and a vocabulary proportion of non-replacement word in the text, wherein;

paragraph ratio without replacement word in text = 1-number of paragraphs with replacement word in text/number of paragraphs contained in text;

the proportion of sentences without replacement words in the text = 1-the number of sentences with replacement words in the text/the total number of sentences contained in the text, the number of sentences with replacement words in the text being the sum of the number of sentences with replacement words in each paragraph in the text, the total number of sentences contained in the text being the sum of the number of sentences contained in each paragraph in the text.

The word ratio of non-replacement words in the text = 1-the number of replacement words in the text/the total number of words contained in the text, the number of words with replacement words in the text being the sum of the number of replacement words per paragraph in the text, the number of replacement words per paragraph being the sum of the number of replacement words per sentence in the paragraph, the total number of words contained in the text being the sum of the number of words contained in each paragraph in the text, the number of words contained in each paragraph being the sum of the number of words of each sentence in the paragraph.

In the application, the paragraph proportion without the replacement word in the text, the sentence proportion without the replacement word in the text and the vocabulary proportion without the replacement word in the text can be combined to generate the chapter level replacement word feature vector of the text.

In this step, the number of sentences with the replacement words in each paragraph and the number of the replacement words in each paragraph can be counted according to the replacement words in the text and the text information, and the paragraph level replacement word feature vector of the text can be obtained according to the number of sentences with the replacement words in each paragraph and the number of the replacement words in each paragraph.

As an alternative embodiment, a paragraph level replacement word feature vector for text may consist of a proportion of sentences in the paragraph that have no replacement words, a proportion of non-replacement words in the paragraph, wherein,

sentence ratio of one paragraph without replacement words = 1-number of sentences with replacement words in the paragraph/total number of sentences in the paragraph;

the proportion of non-replacement words for a paragraph = 1-the number of replacement words in the paragraph/the total number of words in the paragraph, which is the sum of the number of words contained in each sentence in the paragraph.

In the application, the sentence proportion of one paragraph without the replacement word and the proportion of one paragraph without the replacement word can be combined to obtain the replacement word feature vector of one paragraph, and the replacement word feature vectors of all paragraphs are spliced to obtain the paragraph level replacement word feature vector of the text.

assuming that a text has 5 paragraphs, each paragraph has 10 sentences, the total vocabulary number is 500, if there are replacement words in 3 paragraphs, there are replacement words in 15 sentences, and a total of 20 replacement words.

Then, the paragraph ratio without the replacement word in the text=1-3/5=2/5;

sentence ratio without replacement words in the text = 1-15/50 = 7/10;

the vocabulary ratio of non-replacement words in the text is= 1-20/500=24/25;

the text has a chapter level replacement word feature vector of [2/5,7/10,24/25].

Assuming that the replacement term feature vectors of the 5 paragraphs in the text are [1/10,2/5], [0,0], [3/10,2/5], [1/20,1/7], [3/20,3/7], respectively, and splicing the replacement term feature vectors of the 5 paragraphs to obtain the paragraph level replacement term feature vector of the text: [1/10,2/5,0,0,3/10,2/5,1/20,1/7,3/20,3/7].

S1004: and splicing the text chapter level replacement word feature vector with the text paragraph level replacement word feature vector to obtain the text replacement word feature vector.

splicing the chapter level replacement word feature vector of the text in the step S1003 with the paragraph level replacement word feature vector of the text to obtain the replacement word feature vector of the text: [2/5,7/10,24/25,1/10,2/5,0,0,3/10,2/5,1/20,1/7,3/20,3/7].

The description of the device for recognizing the misclassification word disclosed in the embodiment of the present application is provided below, and the device for recognizing the misclassification word described below and the method for recognizing the misclassification word described above can be correspondingly referred to each other.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a device for recognizing a misclassification word according to an embodiment of the present application. As shown in fig. 4, the misclassification apparatus may include:

an obtaining unit 11, configured to obtain an original word sub-string in an original text to be identified, where the original word sub-string is generated by continuous word combinations in the original text;

a determining unit 12, configured to determine a new multi-word corresponding to the original text according to the original word substring;

and a recognition unit 13, configured to recognize a misclassification word in the original text based on the newly added multi-word.

Optionally, the determining unit includes:

a multi-word obtaining unit, configured to obtain a multi-word of the new sentence corresponding to the original text;

Optionally, the new sentence acquisition unit includes:

Optionally, the new multi-word determining unit includes:

Optionally, the identifying unit includes:

Optionally, the replacement word obtaining unit includes:

Optionally, the new text obtaining unit includes:

Optionally, the similarity calculating unit includes:

Optionally, the acquiring unit is specifically configured to:

Optionally, the new text selection unit is specifically configured to:

Fig. 5 is a block diagram of a hardware structure of a system for recognizing a misclassification word according to an embodiment of the present application, and referring to fig. 5, the hardware structure of the system for recognizing a misclassification word may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete the communication with each other through the communication bus 4;

processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present application, etc.;

the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;

wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:

Alternatively, the refinement function and the extension function of the program may be described with reference to the above.

The embodiment of the present application also provides a storage medium storing a program adapted to be executed by a processor, the program being configured to:

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for identifying a misidentification word, comprising:

identifying a misother word in the original text based on the newly added multi-word;

Wherein the determining, according to the original word sub-string, a new multi-word corresponding to the original text includes:

2. The method of claim 1, wherein replacing the original word sub-string with the new word sub-string results in a new sentence corresponding to the original text, comprising:

3. The method of claim 1, wherein the determining the new multi-word from the multi-word of the new sentence corresponding to the original text comprises:

4. The method of claim 2, wherein the identifying the miscord in the original text based on the newly added multi-word comprises:

5. The method of claim 4, wherein the obtaining, based on the target new sentence, a replacement term corresponding to the original text comprises:

acquiring new multi-word in the new text corresponding to the original text;

6. The method of claim 5, wherein replacing the original sentence corresponding to the target new sentence in the original text with the target new sentence, obtaining the new text corresponding to the original text, comprises:

7. The method of claim 6, wherein the calculating the similarity of each new text in the set of candidate new texts to the original text comprises:

8. The method of claim 7, wherein obtaining a feature characterization vector comprises:

9. The method of claim 6, wherein selecting the new text corresponding to the original text from the set of new texts to be selected according to the similarity between each new text and the original text, comprises:

10. A misclassification device, comprising:

The recognition unit is used for recognizing the misclassified words in the original text based on the newly added multi-word words;

wherein the determining unit includes:

11. A system for recognizing a misclassification word, comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the respective steps of the method for recognizing a misclassification word as claimed in any one of claims 1 to 9.

12. A readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method for recognizing a misclassification word as claimed in any one of claims 1 to 9.