CN110941720B

CN110941720B - Knowledge base-based specific personnel information error correction method

Info

Publication number: CN110941720B
Application number: CN201910865592.6A
Authority: CN
Inventors: 黄瑞章
Original assignee: Guizhou Cloud Pioneer Tech Co ltd; Guizhou University
Current assignee: Guizhou Cloud Pioneer Tech Co ltd; Guizhou University
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2023-06-09
Anticipated expiration: 2039-09-12
Also published as: CN110941720A

Abstract

The invention discloses a knowledge base-based specific personnel information error correction method, and relates to the technical field of computer character recognition. The method utilizes a Double-LSTM boundary model to identify and extract the name and other information of a specific person in a text to be detected, compares and calculates the similarity between the extracted information in the text and the specific person information in a specific person knowledge base, judges whether the name and the related information in the current text are correct, establishes a correct name information base, screens out suspected wrong name information, preferentially uses the correct name information base in the text to calculate the similarity of the suspected wrong information and matches other auxiliary information by using the information in the specific person knowledge base, and corrects the suspected wrong information. The method solves the technical difficulty that the names are difficult to identify due to the fact that the sentences contain wrong characters to change text semantics, greatly improves the identification effect of the names and the name information, and realizes end-to-end direct correction of specific names and related information of the specific names in the texts.

Description

Knowledge base-based specific personnel information error correction method

Technical Field

The invention relates to the technical field of computer character recognition, in particular to a specific personnel information error correction method based on a knowledge base.

Background

Most of the current error correction techniques are limited to performing common word matching edit distance calculation on a target field, and selecting a word with the smallest edit distance from candidate words smaller than an edit distance threshold value for error correction. However, in the actual text application scene, only the edit distance comparison is simply performed, whether the target field is wrong or not cannot be accurately determined, and often the information in the context can provide help for finding errors and correcting errors, but the prior art rarely uses the information in the context of the extracted text and is used for correcting errors. Similarly, in the alternative library used for matching and correcting errors with the target field in the prior art, only the target candidate word is often used, and related auxiliary information is absent, so that the judging and correcting accuracy is greatly reduced.

In the existing name entity extraction method, a multi-purpose sequence labeling model, particularly a plurality of neural network technologies in recent years, are applied to a sequence labeling recognition model in various aspects, and good effects are achieved in some application scenes. In sentences containing error information, the effect of sequence labeling on entity name extraction, especially name extraction, is greatly reduced. Because the sequence annotation model often cannot determine whether the current incorrect word is a new word or one of the other words when the incorrect word is encountered.

Disclosure of Invention

The invention aims to provide a specific personnel information error correction method based on a knowledge base, so as to solve the problems in the prior art.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a specific personnel information error correction method based on a knowledge base comprises the following steps:

s1, preprocessing a target text, and simultaneously establishing a common error dictionary;

s2, matching and correcting errors are carried out on the target text by using a common error dictionary;

s3, identifying the preprocessed text to obtain name or/and name information;

s4, comparing the person names obtained through recognition with the person names in the knowledge base, and calculating the similarity.

Preferably, the preprocessing in step S1 includes sentence preprocessing, and sentence in the text is sentence-separated according to sentence characters in the text.

Preferably, step S2 specifically includes: according to sentence feature input calculation, matching whether the input sequence contains errors in a common error dictionary or not by using a character string matching mode for each text sequence, and if so, storing an error field as a recognition result and correcting errors; if the common error is not included, the process proceeds to step S3.

Preferably, the text recognition in step S3 is performed in the following manner:

s3.1, using a HanLP tool to assist a Double-LSTM boundary recognition model to recognize information such as names and names in sentences;

s3.2, extracting pinyin characteristics and wubi characteristics of the name character strings.

Preferably, the step S31 specifically includes:

1) Traversing each word in the sentence to be identified, and dividing the sentence into a left clause and a right clause by taking the current word as the center;

2) Inputting the left clause and the right clause into two different LSTMs respectively for coding;

3) The encoded vectors are input into a full-link layer in cascade for classification, and whether the current word is an entity starting boundary or not is judged;

4) Taking 2-gram and 3-gram with boundaries as the beginning as candidate names, and using a HanLP tool to segment sentences, and identifying the names according to the part of speech nr;

5) The name nnt is identified by the part of speech after word segmentation, and the name closest to the name is searched in the name context as the name to which the name belongs.

Preferably, step S3.2 specifically includes:

extracting pinyin characteristics of name strings, including pinyin of each word, unifying flat tongue and edge sound nasal tones, unifying the flat tongue and the edge sound nasal tones, and unifying the nasal tones as edge sounds; and extracting the five-stroke characteristics of the name character string, including five-stroke codes of each word.

Preferably, the step S4 specifically includes:

s4.1, judging whether the identified person name is the same as the person name in the knowledge base, if the identified person name is a specific person name in the knowledge base, storing the identified person name into a 'text specific person name set', otherwise, storing into a 'suspected wrong person name set';

s4.2, calculating the similarity of the suspected wrong person name and the person name of the specific person in the text; when the similarity is greater than the threshold value, correcting by the name of the specific person; otherwise, step S43 is entered;

s4.3, calculating the similarity of the suspected wrong name and the name of the knowledge base; and judging whether the similarity of the names is greater than a threshold value, if so, correcting the names through the knowledge base, otherwise, judging that the names are not names.

Preferably, step S42 specifically includes:

and (5) name similarity calculation: name similarity = name spelling similarity + name similarity; the name spelling similarity and the name similarity are calculated as follows;

and (5) calculating the similarity of the spelling of the name: respectively calculating the edit distances of character strings, pinyin and wubi of the names of specific personnel and suspected wrong personnel, wherein the edit distance of the pinyin/wubi is the average of the edit distances of each word pinyin/wubi code, and finally calculating the weighted average of the three edit distances as the comprehensive distance; comparing whether the comprehensive distance is larger than a given threshold value, if the comprehensive distance is smaller than the threshold value, then the name similarity=threshold value-comprehensive distance, otherwise, the name similarity=0, the threshold value can be manually given according to whether the spelling similarity requirement is loose or tight in specific application conditions, and the value range is usually 0-1;

the name similarity calculation: number of intersection elements of the name set of the name of the current person and the name set of the name of the specific person of the knowledge base/number of elements of the name set of the current person; if the current person name designation set is not null, but the intersection is null, then the designation similarity is negative.

Preferably, step S4.3 specifically includes:

and (3) calculating name similarity II: name similarity II = name spelling similarity II + name similarity II; the name spelling similarity II and the name similarity II are calculated as follows;

and (3) calculating the similarity II of the name spelling: respectively calculating the edit distances of the character strings, the pinyin and the wubi of the names of the specific personnel and the suspected wrong personnel of the knowledge base, and finally calculating the weighted average of the three edit distances as the comprehensive distance; and comparing whether the comprehensive distance is larger than a given threshold value, if so, the spelling similarity II=threshold value-comprehensive distance, otherwise, the spelling similarity II=0.

The name similarity calculation: number of intersection elements of the name set of the suspected wrong person and the name set of the specific person in the knowledge base/number of elements of the name set of the suspected wrong person; if the suspected wrong person name designation set is not null, but the intersection is null, the designation similarity II is negative.

The beneficial effects of the invention are as follows:

the invention provides and realizes a complete method for correcting the error of specific personnel information based on a knowledge base, firstly, the editing distance between the person name to be identified in an input text and the person name in the knowledge base is calculated only, the person name information is extracted, meanwhile, the information such as the name of the specific personnel in the sentence is extracted as auxiliary information for judgment, and the information is calculated and compared with the information in the knowledge base, so that the context semantic information in the sentence is utilized to make the error correction judgment more reasonable and accurate, and meanwhile, the identification and correction of other information except the person name error correction can be realized. Secondly, the invention uses Double-LSTM model when identifying the name and other information, avoid the technical difficulty when there is wrongly written word in the sentence and can't identify the name, when obtaining the sentence information, will carry on its (except the present word) left and right sides information to each word of the sentence to extract, thus has solved the problem that wrongly written word produces influence to whole sentence semanteme in the sentence effectively, has promoted the name and effect that the information discerns of the name greatly at the same time.

Drawings

FIG. 1 is a flowchart of an implementation of the knowledge base-based personnel specific information correction method in embodiment 1;

fig. 2 is a common error dictionary example in embodiment 1;

FIG. 3 is a schematic diagram of the Double-LSTM model employed in example 1;

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented by way of example only and is not intended to limit the invention.

Example 1

The embodiment provides a specific personnel information error correction method based on a knowledge base, which comprises the following steps:

1) And carrying out sentence dividing processing on the input text. The following model method calculates in sentence as a sequence unit. The incoming text is processed in clauses with clause symbols (, |;; \n\r).

2) Matching using a common error dictionary: the method comprises the steps of maintaining a common error dictionary, matching whether the input sequence contains errors in the common error dictionary by using a character string matching mode for each input sequence, and if the common errors are contained, storing error fields as recognition results and correcting errors, wherein the fields which are already recognized in the common error dictionary are not used in subsequent calculation. The common error dictionary is shown in fig. 2.

3) Identifying the name of a person in a sentence: the method combines a HanLP word segmentation tool and a Double-LSTM boundary recognition model to recognize the name, the boundary recognition model recognizes the name, the boundary recognition model has the function of improving the name recognition effect, and the situation that the word cannot be correctly recognized under the condition of wrong name writing is avoided, and the entity starting boundary is recognized.

And firstly, recognizing name boundaries by using Double-LSTM models for each input sequence, traversing each word in the sequence, dividing a sentence into a left clause and a right clause by taking the current word as the center, respectively inputting the left clause and the right clause into two different LSTM models for coding, inputting the coded vectors into a full-connection layer for classification, classifying the vectors into two classification problems, and judging whether the current word is an entity starting boundary or not. The Double-LSTM model is shown in the figure.

And (3) taking the 2-gram and the 3-gram with the boundaries as the beginning as candidate names according to the boundary recognition result, comparing the candidate names with the specific personnel names in the knowledge base, and reserving the candidate names with the editing distance of 1 or 2 as suspected specific personnel names. And adding the person names of the suspected specific persons into a word segmentation dictionary, segmenting the sentences, and carrying out person name recognition according to the word segmentation part nr.

4) And identifying names in the sentences, judging the names of the people according to the distances, and obtaining name features. The name nnt is identified by the part of speech after word segmentation, and the name closest to the name (except the number of words and the number of words) is searched in the name context as the name to which the name belongs.

5) Extracting pinyin characteristics and five-stroke characteristics of name strings: the character string of the name is extracted with pinyin characteristics (pinyin of each character, and the flat tongue and the edge tone nose tone are unified, the flat tongue is unified, the nose tone is unified as the edge tone), and five-stroke characteristics (five-stroke codes of each character).

6) Judging whether the person name is the same as the person name of the knowledge base: judging whether the identified person name is a specific person name in the knowledge base, if so, storing the specific person name set in the text, otherwise, storing the specific person name set in the suspected wrong person name set.

7) Calculating the person name similarity I of the suspected wrong person name and the person name of the specific person in the text: and calculating the similarity of the names in the suspected wrong name set and the names in the specific person name set, wherein the name similarity consists of two parts, namely name spelling similarity I and name similarity I.

The spelling similarity I of the name is calculated, the editing distances of the character strings, the pinyin and the wubi of the two names are calculated respectively, the editing distance of the pinyin/wubi is the average of the editing distances of the pinyin/wubi codes of each character, finally, the weighted average of the three editing distances is calculated as the comprehensive distance, whether the comprehensive distance is larger than a given threshold value is compared, if the comprehensive distance is smaller than the threshold value, the spelling similarity I of the name is=the threshold value-the comprehensive distance, otherwise, the name similarity is=0, and the threshold value can be given according to specific application conditions.

The name similarity i=the number of intersection elements of the name set of the current person and the name set of the specific person/the number of elements of the name set of the previous person, and if the name set of the current person is not empty but the intersection is empty, the name similarity I is negative, and the value is-0.2 in this embodiment.

8) Calculating person name similarity II of suspected wrong person names and knowledge base person names:

The name similarity calculation: number of intersection elements of the name set of the suspected wrong person and the name set of the specific person in the knowledge base/number of elements of the name set of the suspected wrong person; if the name designation set of the suspected wrong person is not null, but the intersection set is null, the name similarity II is negative, and the value is-0.2 in the embodiment.

Example 2

In this embodiment, taking a specific section as an example, the method in embodiment 1 is used to perform information error correction, and includes the following steps:

1) Information about names, names and the like of specific personnel is extracted from various webpage information, and the information is formed into a specific personnel information knowledge base.

2) And referring to the information error-prone words of the common specific personnel in the network, and extracting to form a common error dictionary.

3) Inputting a text to be identified, and obtaining a result through identification and error correction by the method, wherein the method identifies error correction input and output examples:

a) Input sample

{ "docId": "9", "title": ": university friend new spring communication meeting holding", "text": "educational foundation message: day 14 of 1 month college friend new spring communication is held. The recent 30 schools are the stock school for a long time and care education, and the alumni representative with the development and the contribution of each business of the school gathers together with the leader of the school and the responsible person of the relevant functional department of the school, and together with the new spring of congratulation, the alumni representative contributes to the future development and the contribution of the school. The alumni such as the fuhua international group president Zhao Yong, the blue cursor spreading group director and chief executive officer Zhao Wenquan, the eastern cambridge educational group president Yu Yue and the like are commonly attended with university auxiliary school, educational foundation auxiliary school, auxiliary office Wang Bo, alumni office, alumni conference auxiliary meeting and secretary Li Wensheng, the nostalgic scientific city school area raising office main principal Li Hang, industry party work principal auxiliary book, asset management limited company president Wei Junmin, party work office auxiliary principal, educational foundation auxiliary secret, geng Shu, zhao Lin and the like. The communication will be hosted by the educational foundation secretary Li Ningyu. "}

b) Recognition result:

{ "sense" ("Fuhua International group President Zhao Yong, blue cursor propagation group board length and head executive officer Zhao Wenquan" ], "correct": zhao Wenquan "," wrong ": zhao Wenquan" }

{ "sense" [ "communication will be hosted by the educational foundation secretary Li Ningyu" ], "correct": "Li Yuning", "wrng": "Li Ningyu" }

By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which is also intended to be covered by the present invention.

Claims

1. The specific personnel information error correction method based on the knowledge base is characterized by comprising the following steps:

s3, identifying the target text to obtain name or/and name information;

s4, comparing the person names obtained by recognition with person names in the person name information base and the knowledge base, and calculating the similarity of the person names;

s5, judging whether the name related information in the target text is correct or not, correcting the error item and adding the correct information into a text name information base;

manner of text recognition in step S3:

s3.1, using a HanLP tool to assist a Double-LSTM boundary recognition model to recognize names and title information in sentences;

s3.2, extracting pinyin characteristics and wubi characteristics of the name character strings;

the step S31 specifically includes:

1) Traversing each word in the target text, and dividing the sentence into a left clause and a right clause by taking the current word as the center;

5) Identifying a name nnt by the part of speech after word segmentation, and searching a name closest to the name in the name context as a name to which the name belongs;

the step S4 specifically includes:

s4.2, calculating the person name similarity I between the suspected wrong person name and the person name of the specific person in the text; when the name similarity I is larger than the threshold value, correcting by the name of the specific person; otherwise, step S43 is entered;

s4.3, calculating the name similarity II of the suspected wrong name and the knowledge base name, judging whether the name similarity II is larger than a threshold value, if so, correcting through the knowledge base name, otherwise, judging that the name is not the name needing error correction;

the step S4.2 specifically comprises the following steps:

and (3) calculating name similarity I: name similarity i=name spelling similarity i+name similarity I; the name spelling similarity and the name similarity are calculated as follows;

and (3) calculating the similarity I of the name spelling: respectively calculating the edit distances of the character strings, the pinyin and the wubi of the names of the specific personnel and the suspected wrong personnel, and finally calculating the weighted average of the three edit distances as the comprehensive distance; comparing whether the comprehensive distance is larger than a given threshold value, if so, the spelling similarity of the name=threshold value-the comprehensive distance, otherwise, the spelling similarity of the name i=0; the threshold may be given by the belief of a particular application;

the name similarity I calculation: title similarity i=number of intersection elements of the title set of the current person name and the title set of the person name of the specific person of the knowledge base/number of elements of the current person name title set; if the current name naming set is not empty, but the intersection set is empty, the name similarity I is negative;

the step S4.3 specifically comprises:

and (3) calculating the similarity II of the name spelling: respectively calculating the edit distances of the character strings, the pinyin and the wubi of the names of the specific personnel and the suspected wrong personnel of the knowledge base, and finally calculating the weighted average of the three edit distances as the comprehensive distance; comparing whether the comprehensive distance is larger than a given threshold value, if so, the spelling similarity II=threshold value-the comprehensive distance, otherwise, the spelling similarity II=0;

2. The knowledge base based personnel specific information error correction method according to claim 1, wherein the preprocessing in step S1 comprises sentence preprocessing, and sentence in text is sentence-segmented according to sentence characters in text; the method uses sentences as a sequence unit to calculate.

3. The knowledge base based personnel specific information error correction method of claim 1, wherein step S2 specifically comprises: according to sentence feature input calculation, matching whether the input sequence contains errors in a common error dictionary or not by using a character string matching mode for each text sequence, and if so, storing an error field as a recognition result and correcting errors; if the common error is not included, the process proceeds to step S3.

4. The knowledge base based personnel specific information error correction method according to claim 1, wherein step S3.2 specifically comprises: