CN115309902A

CN115309902A - Credible knowledge corpus automatic labeling method facing government hotline

Info

Publication number: CN115309902A
Application number: CN202211002539.1A
Authority: CN
Inventors: 高永超; 张单
Original assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2022-08-22
Filing date: 2022-08-22
Publication date: 2022-11-08

Abstract

The invention relates to the technical field of data processing, and discloses a credible knowledge corpus automatic labeling method facing to a government hotline, which comprises the following steps: s1, preparing a policy file S2, and carrying out policy file structuring processing; s3, knowledge information collection; s4, processing information data; s5, user grading processing; s6, analyzing and processing the comment emotion of the user; s7, manually labeling the small sample; s8, training a credible classifier model; s9, model optimization; and S10, automatic marking generation. According to the method, a small amount of manual judgment is used as a training data set and is used as data for model training, so that the credibility of high-efficiency automatic labeling of massive knowledge can be realized; the efficiency is higher, the error rate is lower, the complicated work of manual labeling is liberated, and the credible evaluation accuracy of knowledge corpus is improved; the knowledge without user evaluation can be evaluated, and the limitation on an evaluation object is less; the method can be applied to any government hotline system, and has a wide application range.

Description

Credible knowledge corpus automatic labeling method facing government hotline

Technical Field

The invention relates to the technical field of data processing, in particular to a credible knowledge corpus automatic labeling method facing to a government hotline.

Background

Credibility degree of knowledge base is needed to update knowledge map and knowledge base facing government hotline; the knowledge updated in increments needs to be associated with the knowledge with high credibility, and the knowledge with low credibility needs to be deleted when the knowledge is updated in full; maintaining a high level of confidence in the knowledge may better provide government hotline services; therefore, the knowledge in the knowledge graph system and knowledge base of the government hotline needs to be credibly labeled as the existing corpus relied on by subsequent updating.

At present, two kinds of credible labels are available from common knowledge, one is manual labeling after all manual judgment, and the method has low efficiency; the other method is to judge the credibility of knowledge through the user comments of a knowledge map system or a knowledge base, and the method is only suitable for the knowledge which is commented by the user and has a small application range.

Disclosure of Invention

The invention aims to provide a credible knowledge corpus automatic labeling method facing a government hot line, so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

a credible knowledge corpus automatic labeling method facing a government hotline comprises the following steps:

s1, preparing a policy document: aiming at public service information consultation provided by a government hotline, preparing the current policy and regulation, department responsibility, business matters and other public service information;

s2, policy file structuring: according to the file name, the release date, the execution period deadline, the issuing department, the action jurisdiction and the file text, carrying out structuralized processing on the file;

s3, knowledge information collection: collecting a knowledge map of a government hotline and the generation time, revision times, access times, user evaluation and scoring of each piece of knowledge in a knowledge base system;

s4, information data processing: normalizing the generation time, revision times and accessed times of each piece of knowledge;

s5, user grading processing: calculating an average score of the user scores of each knowledge and carrying out normalization processing;

s6, analyzing and processing the comment emotion of the user: for knowledge corpora with user comments, the first 100 Chinese characters input into the comments are used as emotion analysis corpora by default, and if the comments are less than 100 Chinese characters, all the Chinese characters are input; performing emotion analysis on each comment based on an emotion analysis word bank of HowNet to obtain an emotion score, and processing the emotion score;

s7, manual small sample labeling: randomly extracting 500 knowledge corpora from a government hotline system to perform artificial credibility assessment and marking, wherein when the artificial credibility assessment result is credible, the artificial credibility assessment result is marked as 1, and when the artificial credibility assessment result is not credible, the artificial credibility assessment result is marked as 0; checking whether the file name quoted in the knowledge corpus is correct, whether knowledge is in a file publishing and execution deadline interval, whether related departments related to the knowledge conform to file content, whether a knowledge action district conforms to file content, whether classification of the knowledge in a service list is correct, whether personal privacy or other confidential information is leaked from the knowledge, and whether knowledge logic conforms to file text content; if the checked content has errors, the credible evaluation is marked as 0;

s8, training a credible classifier model: taking the normalized processing results of the generation time, revision times, access times, user evaluation and scoring of each knowledge in the steps S4, S5 and S6 as the input tensor of the logistic regression algorithm model through the logistic regression algorithm, and performing credible classifier model training by matching with manually labeled labels;

s9, model optimization: optimizing a sigmoid activation function in a logistic regression algorithm model to adjust a regularization coefficient, so as to optimize a display result of a confusion matrix appearing in the training of a credible classifier model;

s10, automatic annotation generation: and inputting the collected and processed normalized values of the generation time, revision times, access times, user evaluation and scoring of all knowledge into a credible classifier model as tensors, and automatically deducing the credibility value of the knowledge by the credible classifier model to realize credible automatic labeling of the knowledge corpus, wherein the interval of the credibility value is [0, 1].

As a still further scheme of the invention: the formula of the normalization processing in the step S4 is as follows:

in the above-mentioned formula (1),

is a numerical value after the normalization,

in order to obtain the value before normalization,

and

respectively the minimum value and the maximum value in each type of data; if the knowledge has no relevant data, then

Default to 0.

As a still further scheme of the invention: the formula of the normalization processing in the step S5 is as follows:

wherein

A normalized numerical value of the mean value is scored for all users of a certain knowledge,

is as follows

The value of the credit of the individual user,

the total number of times is scored for all users,

is the full score value of the score; if the knowledge does not have user rating, then

Default to 0.

As a still further scheme of the invention: the emotion analysis formula in the step S6 is as follows:

in the above-mentioned formula (3),

if the score is greater than 0, the user comment is positive evaluation; when the score is equal to 0, the user comments as a neutral rating; when the score is less than 0, the user comments are negative derogatory evaluation;

is as follows

The emotion score of each Chinese character in the emotion analysis word bank is 1 when the emotion of the Chinese character in the emotion analysis word bank is classified as 'good' or 'happy', the emotion is 0 when the emotion is classified as 'surprise', and the emotion scores of the rest characters are-1 when the emotion is classified;

inputting the number of Chinese characters;

and the emotion scoring processing formula in the step S6 is as follows:

in the above-mentioned formula (4),

normalized values are analyzed for sentiment of user reviews,

is an emotion score.

As a still further scheme of the invention: and the manual credibility assessment in the step S7 is marked according to the existing policy and regulation, administrative and functional duties, work flow and other public service information which are already finished in the file structuring.

As a still further scheme of the invention: the sigmoid activation function optimization method in the step S9 is as follows:

s91, creating a regularization coefficient list [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9];

s92, respectively using the coefficients in the list for model training, and checking the coefficients with high recall rate;

and S93, selecting the coefficient with the highest recall rate as the optimal parameter of the logistic regression algorithm model.

Compared with the prior art, the invention has the beneficial effects that:

according to the invention, a small amount of manual judgment is only needed to be used as a training data set and used as model training data, so that the reliability of high-efficiency automatic labeling of mass knowledge can be realized; the efficiency is higher, the error rate is lower, the complicated work of manual labeling is liberated, and the credible evaluation accuracy of knowledge and corpus is improved;

by paying attention to five characteristics of the generation time, the revision times, the access times, the user evaluation and the scoring, the user evaluation is only one of the characteristics, and when the user evaluation is lacked, the overall influence on the evaluation result is small; therefore, the knowledge without user evaluation can be evaluated, and the limitation on the evaluation object is less;

by focusing on knowledge corpus and related information thereof, the method is not influenced by a system platform, database types, interfaces and the like, can be applied to any government hot line system, and has a large application range;

the credible classifier model is trained and optimized through a logistic regression algorithm model pair, and the performance of the model can be improved according to the increase of a training data set; after the incremental training is carried out on the model, the accuracy of automatic labeling can be continuously improved.

Drawings

FIG. 1 is a flow diagram of a method for automatically labeling a credible knowledge corpus facing a government hotline;

FIG. 2 is a schematic flow diagram of model training optimization in a government hot line-oriented credible knowledge corpus automatic labeling method.

Detailed Description

Referring to fig. 1-2, in an embodiment of the present invention, a method for automatically labeling a trusted knowledge corpus facing a government hotline includes the following steps:

s1, preparing a policy document: aiming at public service information consultation provided by a government hotline, preparing current policy and regulation, department responsibility, business matters and other public service information; for example: policy and regulation include legal administration, judicial interpretation, local regulation, department regulation and other normative documents;

department responsibilities comprise institution functions, service scope and internal institution;

business items comprise administrative law enforcement, criminal investigation and approval and public services;

other public service information comprises convenient solution and special vocabulary explanation in government affair work;

s3, knowledge information collection: collecting a knowledge graph of a government hotline and the generation time, revision times, access times, user evaluation and grading of each piece of knowledge in a knowledge base system; wherein, the first and the second end of the pipe are connected with each other,

the generating time is the warehousing time of the knowledge corpus;

the revising times are the times of revising each knowledge corpus which can be obtained from the knowledge base log file;

the accessed times are the times of accessing each knowledge corpus of a log file in a service system supported by a knowledge base;

user evaluation and scoring can be obtained from evaluation scoring fields corresponding to each corpus in the knowledge base;

s4, information data processing: normalizing the generation time, revision times and accessed times of each piece of knowledge; the formula of the normalization processing is as follows:

in the above-mentioned formula (1),

is a numerical value after the normalization,

in order to obtain the value before normalization,

and

Defaults to 0; for example: when the number of times of accessing a knowledge corpus a is 80, the corpus B accessed most frequently in the knowledge base is 140 times, the corpus C accessed least frequently is 20 times, the number of times of accessing a after normalization processing is 0.5 (no unit after normalization processing), and the calculation is as follows:

s5, user grading processing: calculating an average score of the user scores of each knowledge and carrying out normalization processing; the formula of the normalization process is as follows:

wherein

is as follows

The value of the credit of the individual user,

the total number of times is scored for all users,

is the score full value; if the knowledge does not have user score, then

Defaults to 0; for example: when the knowledge corpus a is scored 12 times in total, the sum 96 of the scoring values is calculated as a numerator, the scoring value is multiplied by 10 times by the full scoring value of the score 12 times, the grade 12 × 10=120 is used as a denominator, and the user score after the normalization processing is calculated to be 0.8, which is calculated as follows:

s6, user comment sentiment analysis processing: for knowledge corpora with user comments, the first 100 Chinese characters input into the comments are used as emotion analysis corpora in a default mode, wherein the Chinese characters do not contain punctuation marks, and if the comments are less than 100 Chinese characters, all the Chinese characters are input; performing emotion analysis on each comment based on an emotion analysis word bank of HowNet to obtain an emotion score, and processing the emotion score; the emotion analysis formula is as follows:

in the above-mentioned formula (3),

is as follows

The emotion score of each Chinese character in the emotion analysis word bank is 1 when the emotion of the Chinese character in the emotion analysis word bank is classified as 'good' or 'happy', 0 when the emotion is classified as 'fright', and-1 when the rest of the emotion is classified, and the score is-1 when the emotion is classified as 'grief', 'anger', 'fear' or 'badness';

inputting the number of Chinese characters; for example: example user comments:

thanks to the fact that a government hotline provides a convenient platform, the user can easily find a way to answer questions; it is not clear how to handle the long-distance transfer of the accumulation fund, the specific handling process and the materials needing to be prepared are found through the platform, and a great amount of time and energy are saved for me; again indicating thank you! "

Calculating the sentiment score of the user comment; wherein, the emotion classification is that 'good' or 'le' appears for 5 times, the score is 5, the emotion classification is that 'surprise' appears for 47 times, the score is 0, and the rest emotion classifications appear for 1 time, the score is-1; and performing emotion analysis on the evaluation example of the knowledge corpus, wherein the emotion score is 4, and the calculation is as follows:

the emotion score processing formula is as follows:

in the above-mentioned formula (4),

normalized values are analyzed for sentiment of user reviews,

scoring the sentiment;

when the emotion score is 4, satisfy

After normalization processing of the above formula (4), the normalized value is analyzed for emotion

Is 1;

s7, manual small sample labeling: randomly extracting 500 knowledge corpora from a government hotline system to perform artificial credibility assessment and marking, wherein when the artificial credibility assessment result is credible, the artificial credibility assessment result is marked as 1, and when the artificial credibility assessment result is not credible, the artificial credibility assessment result is marked as 0; manually marking whether a file name quoted in a knowledge corpus is correct, whether knowledge is in a file publishing and execution deadline date interval, whether related departments related to the knowledge conform to file content, whether a knowledge action jurisdiction conforms to file content, whether classification of the knowledge in a service list is correct, whether knowledge leaks personal privacy or other confidential information, and whether knowledge logic conforms to file text content according to the current policy and regulation, administrative duty, work flow and other public service information which have finished file structuring; if the checked content has errors, the credible evaluation is marked as 0; the system is always focused on knowledge corpora and relevant information thereof, so that the system is not influenced by a system platform, a database type, an interface and the like, and can be applied to any government hotline system; corpus example:

"Notification of repeat group stable employment work such as graduate in colleges and universities" No. 2022 document 8 of "XX city people government office about transfer X administration office published by XX city government office at 27 th 5.20.2023 No." (registration for employment of graduate in the department of colleges and universities "and" registration for employment of graduate in the nation "are no longer issued from 2023 th year"

Manual verification:

manually marking date information related to the knowledge corpus, namely the published information of the notification and the specified starting date; checking whether the date information is correct or not and checking whether the knowledge corpus is in the validity period or not; the release date is 2022, 5 months and 27 days, the execution is started from 2023, and the date information is correct;

manual labeling needs to check whether a policy issuing department related to the knowledge corpus is correct or not; the policy is issued by XX city government office, and the information is correct;

manual labeling requires checking whether the policy file name involved in the knowledge corpus is correct; the file name is correct;

manually marking whether action objects and action jurisdictions of knowledge linguistic data are in accordance with policy documents or not; the corpus has no related information and cannot be verified;

manual labeling needs to check whether the knowledge type of the knowledge corpus is accurate or not; slightly;

manual labeling needs to check whether the policy text information related in the knowledge corpus is correct and complete or whether privacy is leaked; the information is correct;

labeling results:

through the information verification, the knowledge corpus is manually marked as 1;

s8, training a credible classifier model: because the generation time, the revision times, the access times, the user evaluation and the score of each piece of knowledge have no strong association relationship, the normalization processing results of the generation time, the revision times, the access times, the user evaluation and the score of each piece of knowledge in the steps S4, S5 and S6 are used as the input tensor of the logistic regression algorithm model through the logistic regression algorithm and matched with the manually labeled labels to carry out the training of the credible classifier model, the user evaluation is only one of the characteristics, and when the user evaluation characteristics are lacked, the overall influence on the evaluation result is small, so that the knowledge without the user evaluation can be evaluated, and the limitation on the evaluation object is less; tensor example per knowledge corpus input: [0.7, 0.9, 0.5, 1, 0.8];

s10, automatic annotation generation: and inputting the normalized values of the generation time, revision times, access times, user evaluation and scoring of all the collected and processed knowledge as tensors into a credible classifier model, wherein the credible classifier model automatically deduces the credibility value of the knowledge to realize credible automatic labeling of knowledge linguistic data, the credibility value interval is [0, 1], the closer the credibility value is to 1, the higher the credibility is, and the knowledge user can perform custom threshold screening, sequencing and the like according to the credibility value of the labeled knowledge to obtain the credible knowledge linguistic data most expected by the user because the interval is [0, 1] is labeled for the finally evaluated knowledge credibility.

Preferably, the human credibility assessment in step S7 is labeled according to the current policy and regulation, administrative and functional duties, work flow and other public service information that have completed the document structuring.

Preferably, the sigmoid activation function optimization method in the step S9 is as follows:

The above embodiments are only preferred embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equally replaced or changed within the scope of the present invention.

Claims

1. A credible knowledge corpus automatic labeling method facing to a government hotline is characterized by comprising the following steps:

s3, knowledge information collection: collecting a knowledge graph of a government hotline and the generation time, revision times, access times, user evaluation and grading of each piece of knowledge in a knowledge base system;

s4, information data processing: normalizing the generation time, the revision times and the accessed times of each piece of knowledge;

s6, analyzing and processing the comment emotion of the user: for knowledge corpora with user comments, the first 100 Chinese characters input into the comments are used as emotion analysis corpora by default, and if the comments are less than 100 Chinese characters, all the Chinese characters are input; performing sentiment analysis on each comment based on a sentiment analysis word bank of HowNet to obtain a sentiment score, and processing the sentiment score;

s7, manual small sample labeling: randomly extracting 500 knowledge corpora from a government hot line system to perform artificial credibility assessment and marking, wherein when the artificial credibility assessment result is credible, the artificial credibility assessment result is marked as 1, and when the artificial credibility assessment result is not credible, the artificial credibility assessment result is marked as 0; checking whether the file name quoted in the knowledge corpus is correct, whether knowledge is in a file publishing and execution deadline interval, whether related departments related to the knowledge conform to file content, whether a knowledge action district conforms to file content, whether classification of the knowledge in a service list is correct, whether personal privacy or other confidential information is leaked from the knowledge, and whether knowledge logic conforms to file text content; if the checked content has errors, the credible evaluation is marked as 0;

s10, automatic annotation generation: and inputting the collected and processed normalization values of the generation time, revision times, access times, user evaluation and scoring of all knowledge into a credible classifier model as tensor, wherein the credible classifier model automatically deduces the credibility value of the knowledge to realize credible automatic labeling of knowledge corpus, and the interval of the credibility value is [0, 1].

2. The method for automatically labeling the credible knowledge corpus of the government hotline according to claim 1, wherein the normalization processing in the step S4 is as follows:

in the above-mentioned formula (1),

is a numerical value after the normalization,

is a numerical value before the normalization is carried out,

and

Default to 0.

3. The method for automatically labeling the credible knowledge corpus of the government hotline according to claim 1, wherein the normalization processing in the step S5 is as follows:

wherein

is as follows

The value of the credit of the individual user,

the total number of times is scored for all users,

is the score full value; if the knowledge does not have user rating, then

Default is to0。

4. The method for automatically labeling the credible knowledge corpus of the government hotline according to claim 1, wherein the emotion analysis formula in the step S6 is as follows:

in the above-mentioned formula (3),

if the score is greater than 0, the user comment is positive evaluation; when the score is equal to 0, the user comments as a neutral rating; when the score is less than 0, the user comments are negative dereferencing evaluation;

is a first

The emotion scores of the Chinese characters in the emotion analysis word bank are 1 when the emotions of the Chinese characters in the emotion analysis word bank are classified into 'good' or 'happy', 0 when the emotions are classified into 'surprise', and-1 when the rest emotions are classified;

inputting the number of Chinese characters;

and the emotion score processing formula in the step S6 is as follows:

in the above-mentioned formula (4),

normalized values are analyzed for sentiment of user reviews,

is an emotion score.

5. The method for automatically labeling the credible knowledge corpus of the government hotline according to claim 1, wherein the manual credible assessment labeling in the step S7 is based on the current policy and regulation, administrative and functional duties, work flow and other public service information which have been structured.

6. The method for automatically labeling the credible knowledge corpus facing the government hotline according to claim 1, wherein the sigmoid activation function optimization method in the step S9 is as follows: