CN110610000A

CN110610000A - Key name context error detection method and system

Info

Publication number: CN110610000A
Application number: CN201910737596.6A
Authority: CN
Inventors: 张勇; 朱立松
Original assignee: CCTV INTERNATIONAL NETWORKS WUXI Co Ltd
Current assignee: CCTV INTERNATIONAL NETWORKS WUXI Co Ltd
Priority date: 2019-08-12
Filing date: 2019-08-12
Publication date: 2019-12-24

Abstract

The invention relates to a method and a system for detecting errors of key name context, wherein the method comprises the following steps: determining a keyword name set needing to be audited; selecting N continuous characters as dark words or key names, and then taking the context of the dark words or the key names; respectively segmenting the dark words or the key name context by using a segmentation algorithm; vectorizing the word segmentation result; inputting the result into a machine learning classifier, and outputting the result by the classifier; and when the key name belongs to the name in the key name set and is different from the output of the classifier, judging the word as an error context, and prompting an auditor to perform key audit. The system comprises an input module, a dark word selection module, a name calibration module, a context taking module, a word segmentation module, a word vectorization module, a classifier module and an alarm module. The invention has the advantages that: dark words with meaning can be identified, and key names that appear in the wrong context can be identified.

Description

Key name context error detection method and system

Technical Field

The invention relates to a method and a system for detecting a key name context error, belonging to the technical field of text information processing.

Background

The Internet becomes a network for people to participate, the national political life is as large as that of the Internet, and the small size is as small as that of the salt, sauce and vinegar, and none of the Internet is related. The Internet is a virtual space, people can participate in the Internet and speak, and the Internet naturally has entertainment tendency; but when the internet is combined with some serious topics, we have to strictly regulate it to maintain its seriousness.

However, this work appears to be simple and presents a number of challenges in practice. The name "vermilion patch" is used as an example to list common misspellings:

1) in case of a sound error, the three characters with the same pronunciation as the "Zhu" and "Yuan" and the "jade article are used to refer to the" Zhu Yuan article ". For example: the traditional Chinese medicine composition comprises a pig element, a Zhu Chao, a Zhu Yuan ledger and the like. This type of error is often due to errors that occur when the netizen uses the chinese pinyin input method.

2) In the case where the abbreviation is wrong, for example, "ZYZ" or "zhuyuanzhang" or "ZYZ" or the like is used instead of "vermilion patch".

3) In the case of complete error, for example, the correct original sentence should be: "the vermilion tablet beat the army of the monarch, beat all other strong enemies, establish a new dynasty", the wrong sentence is: the pig eight defeats the monarch, defeats all other strong enemies, and establishes a new dynasty. In this case, three words of "vermilion patch" do not appear in the sentence, but we can see that "pig eight" in the wrong sentence actually refers to the vermilion patch. Moreover, this refers to the profanity of the body, which is very likely to cause public opinion. And because the 'eight pigs' and the 'vermilion jade tablet' are different in tone and character, the difficulty of examination is increased, and an auditor can only deduce according to the context.

4) In the case of a context error, for example, the correct original sentence would be: "Chu overlord is one of the most powerful enemies of Han Liu bang, but he is also finally defeated"; the wrong sentence is: "Chu Bawang is one of the most powerful enemies of the vermilion patch, but he is also finally defeated". In this case, the three characters of the Zhuyu jade tablet have no misspelling, but the Zhuyu jade tablet and the Han Liu bang are mixed.

5) Context-independent cases, such as the original sentence: the children are helped to write and cook at home at the end of the week of the Zhuyu jade tablet, and the life of a couple of scholars is passed. It may be true that a person who is the same name as "vermilion article" is actually in the course of a scholar, but since "vermilion article" is a well-known historical figure, it is not appropriate. The fact that the context is not relevant is also a context error.

In order to deal with the above situation, a keyword scanning mode is generally adopted in the prior art to assist manual review. The keyword scanning system scans and highlights keywords appointed by the auditors to remind the auditors of paying attention to the keywords. This increases the difficulty of review due to the variety of possible errors, and the keyword scanning system has limited assistance to human beings. The prior art approach is to scan the text using a computer, match all the correct three words "mercury tablets", and know possible errors, such as: eight pig, dried pork slice, Zhu Chao, ZYZ, zhuyuanzhang, ZYZ, etc. The matching items are provided to a manual auditor in a highlighted form for manual auditing.

The prior art has the following defects:

1) the technical approach of using keyword scanning does not allow to enumerate all possible error patterns. In addition to the common errors listed above, there may be a variety of other specific errors, such as "pig weight eight".

2) By adopting the technical method of keyword scanning, a large number of scanning results can be obtained, so that auditors cannot be effectively assisted, the auditing time is shortened, and the auditing efficiency is improved.

3) In the prior art, only key words which need to be focused by a manual auditor can be marked, and in fact, whether errors exist or not needs to be judged comprehensively by combining context. Such as the case of the complete error listed above, using the prior art is ineffective because no keywords will be matched in the wrong sentence.

4) For the situation that the context is wrong and irrelevant, the existing keyword marking technology is also powerless, so that an auditor is required to have certain political history knowledge, and people who know that Zhang Fei and the vermilion jade tablet are not in the same era can find the error in the sentence.

Disclosure of Invention

The invention provides a method and a system for detecting a key name context error, aiming at overcoming the defects in the prior art, and utilizing context related to a name to carry out comprehensive analysis to predict whether a key name has a complete error condition or not and predict whether a key name has a context error or has no relation to the context or not.

The technical solution of the invention is as follows: a key name context error detection method comprises the following steps:

step 1: firstly, determining a key name set needing to be audited, wherein the set comprises others;

step 2: selecting N continuous characters in an article or a sentence or a section of a sentence as a word, and then taking the context of the dark word; or selecting a key name in an article or a sentence or a paragraph and selecting the context of the key name;

and 3, step 3: respectively segmenting the context of the dark words or the key names by using a segmentation algorithm;

and 4, step 4: vectorizing the word segmentation result;

and 5, step 5: inputting the vectorized word segmentation result into a machine learning classifier, and outputting an instruction indicating whether the context belongs to one of the key name sets determined in the step 1, and if so, indicating which one of the key name sets belongs to;

and 6, step 6: when the dark word does not belong to the names in the key name set and the output of the classifier is not other, the dark word is judged to be a wrong word, an auditor is prompted to perform key audit,

and when the key name belongs to the names in the key name set and is different from the output of the classifier, judging that the word is in an error context, and prompting an auditor to perform key audit.

Preferably, in step 1, a set of key NAMES, NAMES { "α", "β", "γ", … …, "NONE" }, where "α", "β", "γ", and the like are NAMES that need to be focused on, and "NONE" denotes others, is determined.

Preferably, in the step 2, the context of the dark word or the key name is taken, specifically: taking the M characters on the left side of the dark word or the key name as the upper text of the dark word or the key name, and taking the M characters on the right side of the dark word or the key name as the lower text of the dark word or the key name; when the dark words or the key names appear at the beginning of the sentence, no words above or less than M words above exist; when dark words or key names appear at the end of a sentence, there is no or less than M words below.

Preferably, in the step 4, the word segmentation result is vectorized, specifically: converting each word or character into a vector with D dimension, if K words exist in the text and K words exist in the text, obtaining a data matrix with D rows and 2K columns according to the context, and recording the matrix as C_D×2K(ii) a And if the number of the words is less than K, complementing the words by using a 0 vector.

Preferably, in the 5 th step, C is added_D×2KInputting a machine learning classifier, wherein the classifier outputs a condition that the context does not belong to one of the key names in the key name set determined in the step 1, namely 'NONE'.

Preferably, in the step 6, specifically, when the dark word does not belong to the set NAMES- { "NONE" } and the output of the classifier is not NONE, it is determined that the dark word is a wrong word, and the auditor is prompted to perform a focused audit, the NAMES- { "NONE" } is a difference set between the set NAMES and the set { "NONE" },

and when the name of the key person belongs to the set NAMES- { "NONE" } and is different from the output of the classifier, judging that the word is an error context, and prompting an auditor to perform key audit.

A system for detecting context errors of key names comprises

The input module is used for inputting a given text to be audited and a keyword name set needing to be audited;

the dark word selection module is used for assuming that any continuous N characters in the text to be audited form a dark word;

the name calibration module is used for directly calibrating names in the key name set appearing in the text to be examined;

a context selecting module for selecting context of the dark word or the name according to the dark word selected by the dark word selecting module or the name marked by the name marking module,

a word segmentation module for segmenting words from the context selected by the context selection module by using a word segmentation algorithm,

the word vectorization module is used for converting the words obtained by the word segmentation module into D-dimensional vectors, and the contexts respectively segment K words to obtain a matrix with D rows and 2K columns for representing the contexts corresponding to the dark words or the names, and the vectors are complemented by 0 when the number of the words is less than K;

the classifier module is used for inputting a matrix of D rows and 2K columns output by the word vectorization module and outputting a certain element in the key name set, and the classifier module is a machine learning classifier;

the alarm module is used for judging that a dark word is used for referring to a certain key name when the classifier module predicts the context of the dark word as the context belonging to the certain key name, giving an alarm and enabling the output module to be highlighted so as to prompt an auditor to perform key audit; or when the context of a key name A is input into the classifier module for classification and the output of the classifier module is not A, judging that the word is the wrong context, giving an alarm, highlighting the output module and prompting an auditor to perform key audit.

And the output module is used for outputting the highlight display signal transmitted by the alarm module.

Preferably, the text to be audited is an article or a sentence or a paragraph;

the key name set needing to be audited is NAMES, NAMES { "alpha", "beta", "gamma", … …, "NONE" }, wherein the "alpha", "beta", "gamma" and the like are NAMES needing important attention, and the "NONE" represents others;

when the context taking module selects the context of the dark word or the name, taking the M characters on the left side of the dark word or the name as the upper text of the dark word or the name, taking the M characters on the right side of the dark word or the name as the lower text of the dark word or the name, when the dark word or the name appears at the beginning of a sentence, the upper text or the upper text is not short of the M characters, and when the dark word or the name appears at the end of the sentence, the lower text or the lower text is not short of the M characters;

the matrix of the D rows and the 2K columns is C_D×2K；

The classifier module predicts the context of a dark word as belonging to the context of a certain key name, namely the dark word does not belong to a set NAMES- { "NONE" }, and the output of the classifier is not NONE; the context of a key name a is input to the classifier for classification, the output of which is not a, i.e. when the key name belongs to the set NAMES- { "NONE" } and is different from the output of the classifier.

The invention has the advantages that: the comprehensive analysis is carried out by utilizing the context related to the names of the people, the condition that whether a certain key name has a complete error of the key name can be predicted, the condition that whether a certain key name has a context error or is irrelevant to the context can be predicted, the dark words with the meaning can be identified, the key name appearing in the wrong context can be identified, an auditor can be effectively assisted, the auditing time of the content needing to be audited on the Internet can be shortened, and the auditing efficiency and the auditing accuracy can be improved.

Drawings

FIG. 1 is a schematic diagram of the architecture of a key name context error detection system according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and specific embodiments.

Linguistic and computer natural language processing related studies indicate that the meaning of a word is not determined by itself, but rather by its context. A word alone can be presented to a particular meaning only if a large number of linguistic phenomena have been assigned the particular meaning for which the word is relatively stable.

Example 1: in the wrong sentence: in the process that eight pigs defeat the army of the Yuan empire and defeat all other strong enemies, a new dynasty is established, a human reader can easily infer that eight pigs mean red jade tablets, namely, the human reader can infer by using context knowledge and other prior background knowledge. Many people who beat the great army of the Yuan empire have established a new dynasty, and only one person is the jade tablet. Therefore, based on these contexts and background, the human reader infers that "pig eight" refers to vermilion patch. It can be seen from this example that although the "eight pig" sound is different from the "red jade tablet" sound in character and shape, and the wording is completely wrong, the reader can still infer that "eight pig" refers to "red jade tablet" because the context plays a great role.

Example 2: in the wrong sentence "the writing work of the child at home at the end of the week of the vermilion jade tablet" and the life of the scholar textbook "after the writing work of the child, the three characters of the" vermilion jade tablet "are completely correct, but the whole sentence is incorrect because the three characters of the" vermilion jade tablet "are collocated with the wrong context. Since the vermilion patch is a well-known historical figure, the three words "vermilion patch" have been given a very stable meaning, i.e., meaning the mingdian of mingdu, rather than a family woman, due to a number of language phenomena. So that the context of "Zhuyu jade article" and "Xiang Fu Zi" are matched together in error.

Examples

A key name context error detection method comprises the following steps:

step 1: first, a set NAMES of key NAMES to be audited is determined, for example: NAMES { "vermilion article", "elbow king", "bang", "NONE" }. Wherein "Zhuyue Jade article" and the like are names of persons needing attention, and "NONE" means others.

Step 2: for any article or a sentence or a paragraph, any continuous N characters are selected as a word, and for the convenience of description, the word is called as "dark word", because the word may refer to a politically sensitive character. The following takes the context of this word, specifically: taking M characters on the left side of the word as the upper text of the word, and taking M characters on the right side of the word as the lower text of the word. When the dark words appear at the beginning of the sentence, no or less than M characters exist; when dark words appear at the end of a sentence, there is no or less than M words of text.

Eight pig defeats the great army of the Yuan empire, defeats all other strong enemies, and establishes a new dynasty

For the above paragraph, two examples are given:

example 1: assuming that N is 2 and M is 10, two words "beat" are selected as the dark word, the upper part of the dark word is [ eight beats the grand army of the monarch ], and the lower part of the dark word is [ all other strong enemies, established ].

Example 2: assuming that N is 3 and M is 10, three characters "other" are selected as a dark word, the upper part of the dark word is [ army of monarch, beat ], and the lower part of the dark word is [ all strong enemies, establish a new word ].

Alternatively, for any article or sentence or paragraph, the key names are selected and the context of the key names is selected. The context selection method of the key name is the same as the context selection method of the dark word.

And 3, step 3: the above and below are segmented separately, for example the above [ eight defeats the monarch's army ] will be divided into word sequences: [ "eight", "beat-up", "passed", "Yudi", "of", "army" ], hereinafter [ all other strong enemies, set-up ] will be divided into word sequences: [ "has", "other", "all", "strong enemy", ",", "" ]. The Chinese word segmentation is a common algorithm in the Chinese natural language processing in the prior art, and is not described herein. Instead of word segmentation, the context may simply be segmented into individual words and punctuation.

And 4, step 4: and vectorizing the word segmentation result, namely converting each word (or character) into a D-dimensional vector, supposing that K words exist in the text and K words also exist in the text, and complementing the situation of less than K words by using a 0 vector. Thus, a data matrix of D rows and 2K columns is obtained according to the context, and the matrix is recorded as C_D×2K。

And 5, step 5: c is to be_D×2KAs an input to a machine learning classifier, the output of the classifier isIndicating which person name (possibly "NONE") in the set of key person names (given in step 1) the context should belong to.

And 6, step 6: when the dark word does not belong to a set NAMES- { "NONE" } (a difference set of the set NAMES and the set { "NONE" }) and the output of the classifier is not NONE, judging that the dark word is a wrong word, and prompting an auditor to perform key audit.

As shown in FIG. 1, a key name context error detection system comprises

And the input module is used for inputting the given text to be audited and the appointed name set. For example, the text to be reviewed is an article of political subject matter, or a news report. The set of names is the set of key names that are of interest to the audit transaction, as in the example of step 1 above.

And the dark word selection module is used for assuming that any continuous N characters in the text to be audited form a dark word. So when the text to be reviewed is long, there are many possible dark words.

And the name calibration module is used for directly calibrating the names in the name set appearing in the text to be examined. This step requires only a simple match to perform the calibration.

And the context taking module is used for selecting the context of the dark words or the marked names according to the result of the dark word selection or the result of the name marking.

And the word segmentation module is used for segmenting words of the context by using a word segmentation algorithm. Because Chinese is different from English, Chinese is a continuous written Chinese character, and words composed of Chinese characters have no space separation.

And the word vectorization module is used for converting the words into D-dimensional vectors, and if the contexts are respectively divided into K words, a matrix with D rows and 2K columns can be obtained to represent the context corresponding to a certain dark word or name. The processing of the computer can be facilitated.

And the classifier module is used for inputting a matrix of D rows and 2K columns output by the word vectorization module and outputting a certain element in the name set. The classifier can be trained using a method of machine learning.

The alarm module is used for judging that a dark word is used for referring to a certain key name when the context of the dark word is predicted to belong to the context of the certain key name by the classifier (namely the dark word does not belong to the set NAMES- { "NONE" }, and the output of the classifier is not NONE), giving an alarm and highlighting the alarm to prompt an auditor to perform key audit; or when the context of a key person name A is input into the classifier for classification and the output of the classifier is not A (namely when the key person name belongs to a set NAMES- { "NONE" } and is different from the output of the classifier), judging that the word is an error context, giving an alarm and highlighting to prompt an auditor to perform key audit.

All the above components are prior art, and those skilled in the art can use any model and existing design that can implement their corresponding functions.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various changes and modifications can be made without departing from the inventive concept of the present invention, and these changes and modifications are all within the scope of the present invention.

Claims

1. A key name context error detection method is characterized by comprising the following steps:

and 4, step 4: vectorizing the word segmentation result;

2. A method as claimed in claim 1, wherein in step 1, the key name context error detection method determines the key name set NAMES, NAMES { "α", "β", "γ", … …, "NONE" } which is the name of the person to be focused on, and "NONE" which is the other.

3. The method for detecting the context error of the key name according to claim 2, wherein in the step 2, the context of the dark word or the key name is taken, and specifically: taking the M characters on the left side of the dark word or the key name as the upper text of the dark word or the key name, and taking the M characters on the right side of the dark word or the key name as the lower text of the dark word or the key name; when the dark words or the key names appear at the beginning of the sentence, no words above or less than M words above exist; when dark words or key names appear at the end of a sentence, there is no or less than M words below.

4. The method as claimed in claim 3, wherein in the step 4, the word segmentation result is vectorized, specifically: converting each word or character into a D-dimensional vector, if there are K words in the above, the followingAlso having K words, a data matrix of D rows and 2K columns is obtained according to the context, and the matrix is recorded as C_D×2K(ii) a And if the number of the words is less than K, complementing the words by using a 0 vector.

5. The method as claimed in claim 4, wherein in the step 5, C is added_D×2KInputting a machine learning classifier, wherein the classifier outputs a condition that the context does not belong to one of the key names in the key name set determined in the step 1, namely 'NONE'.

6. The method as claimed in claim 5, wherein in the step 6, specifically, when the dark word does not belong to the set NAMES- { "NONE" } and the output of the classifier is not NONE, the dark word is determined to be a wrong word, so as to prompt the auditor to perform a focused audit, NAMES- { "NONE" } is a difference set between the set NAMES and the set { "NONE" },

7. A key name context error detection system is characterized by comprising

8. The system according to claim 7, wherein the text to be reviewed is an article, a sentence, or a paragraph;

the matrix of the D rows and the 2K columns is C_D×2K；