CN115309902A - Credible knowledge corpus automatic labeling method facing government hotline - Google Patents

Credible knowledge corpus automatic labeling method facing government hotline Download PDF

Info

Publication number
CN115309902A
CN115309902A CN202211002539.1A CN202211002539A CN115309902A CN 115309902 A CN115309902 A CN 115309902A CN 202211002539 A CN202211002539 A CN 202211002539A CN 115309902 A CN115309902 A CN 115309902A
Authority
CN
China
Prior art keywords
knowledge
credible
user
government
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211002539.1A
Other languages
Chinese (zh)
Inventor
高永超
张单
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Computer Science Center National Super Computing Center in Jinan
Original Assignee
Shandong Computer Science Center National Super Computing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Computer Science Center National Super Computing Center in Jinan filed Critical Shandong Computer Science Center National Super Computing Center in Jinan
Priority to CN202211002539.1A priority Critical patent/CN115309902A/en
Publication of CN115309902A publication Critical patent/CN115309902A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of data processing, and discloses a credible knowledge corpus automatic labeling method facing to a government hotline, which comprises the following steps: s1, preparing a policy file S2, and carrying out policy file structuring processing; s3, knowledge information collection; s4, processing information data; s5, user grading processing; s6, analyzing and processing the comment emotion of the user; s7, manually labeling the small sample; s8, training a credible classifier model; s9, model optimization; and S10, automatic marking generation. According to the method, a small amount of manual judgment is used as a training data set and is used as data for model training, so that the credibility of high-efficiency automatic labeling of massive knowledge can be realized; the efficiency is higher, the error rate is lower, the complicated work of manual labeling is liberated, and the credible evaluation accuracy of knowledge corpus is improved; the knowledge without user evaluation can be evaluated, and the limitation on an evaluation object is less; the method can be applied to any government hotline system, and has a wide application range.

Description

Credible knowledge corpus automatic labeling method facing government hotline
Technical Field
The invention relates to the technical field of data processing, in particular to a credible knowledge corpus automatic labeling method facing to a government hotline.
Background
Credibility degree of knowledge base is needed to update knowledge map and knowledge base facing government hotline; the knowledge updated in increments needs to be associated with the knowledge with high credibility, and the knowledge with low credibility needs to be deleted when the knowledge is updated in full; maintaining a high level of confidence in the knowledge may better provide government hotline services; therefore, the knowledge in the knowledge graph system and knowledge base of the government hotline needs to be credibly labeled as the existing corpus relied on by subsequent updating.
At present, two kinds of credible labels are available from common knowledge, one is manual labeling after all manual judgment, and the method has low efficiency; the other method is to judge the credibility of knowledge through the user comments of a knowledge map system or a knowledge base, and the method is only suitable for the knowledge which is commented by the user and has a small application range.
Disclosure of Invention
The invention aims to provide a credible knowledge corpus automatic labeling method facing a government hot line, so as to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
a credible knowledge corpus automatic labeling method facing a government hotline comprises the following steps:
s1, preparing a policy document: aiming at public service information consultation provided by a government hotline, preparing the current policy and regulation, department responsibility, business matters and other public service information;
s2, policy file structuring: according to the file name, the release date, the execution period deadline, the issuing department, the action jurisdiction and the file text, carrying out structuralized processing on the file;
s3, knowledge information collection: collecting a knowledge map of a government hotline and the generation time, revision times, access times, user evaluation and scoring of each piece of knowledge in a knowledge base system;
s4, information data processing: normalizing the generation time, revision times and accessed times of each piece of knowledge;
s5, user grading processing: calculating an average score of the user scores of each knowledge and carrying out normalization processing;
s6, analyzing and processing the comment emotion of the user: for knowledge corpora with user comments, the first 100 Chinese characters input into the comments are used as emotion analysis corpora by default, and if the comments are less than 100 Chinese characters, all the Chinese characters are input; performing emotion analysis on each comment based on an emotion analysis word bank of HowNet to obtain an emotion score, and processing the emotion score;
s7, manual small sample labeling: randomly extracting 500 knowledge corpora from a government hotline system to perform artificial credibility assessment and marking, wherein when the artificial credibility assessment result is credible, the artificial credibility assessment result is marked as 1, and when the artificial credibility assessment result is not credible, the artificial credibility assessment result is marked as 0; checking whether the file name quoted in the knowledge corpus is correct, whether knowledge is in a file publishing and execution deadline interval, whether related departments related to the knowledge conform to file content, whether a knowledge action district conforms to file content, whether classification of the knowledge in a service list is correct, whether personal privacy or other confidential information is leaked from the knowledge, and whether knowledge logic conforms to file text content; if the checked content has errors, the credible evaluation is marked as 0;
s8, training a credible classifier model: taking the normalized processing results of the generation time, revision times, access times, user evaluation and scoring of each knowledge in the steps S4, S5 and S6 as the input tensor of the logistic regression algorithm model through the logistic regression algorithm, and performing credible classifier model training by matching with manually labeled labels;
s9, model optimization: optimizing a sigmoid activation function in a logistic regression algorithm model to adjust a regularization coefficient, so as to optimize a display result of a confusion matrix appearing in the training of a credible classifier model;
s10, automatic annotation generation: and inputting the collected and processed normalized values of the generation time, revision times, access times, user evaluation and scoring of all knowledge into a credible classifier model as tensors, and automatically deducing the credibility value of the knowledge by the credible classifier model to realize credible automatic labeling of the knowledge corpus, wherein the interval of the credibility value is [0, 1].
As a still further scheme of the invention: the formula of the normalization processing in the step S4 is as follows:
Figure 100002_DEST_PATH_IMAGE002
in the above-mentioned formula (1),
Figure 100002_DEST_PATH_IMAGE004
is a numerical value after the normalization,
Figure 100002_DEST_PATH_IMAGE006
in order to obtain the value before normalization,
Figure 100002_DEST_PATH_IMAGE008
and
Figure 100002_DEST_PATH_IMAGE010
respectively the minimum value and the maximum value in each type of data; if the knowledge has no relevant data, then
Figure 170635DEST_PATH_IMAGE004
Default to 0.
As a still further scheme of the invention: the formula of the normalization processing in the step S5 is as follows:
Figure 100002_DEST_PATH_IMAGE012
wherein
Figure 743568DEST_PATH_IMAGE004
A normalized numerical value of the mean value is scored for all users of a certain knowledge,
Figure 100002_DEST_PATH_IMAGE014
is as follows
Figure 100002_DEST_PATH_IMAGE016
The value of the credit of the individual user,
Figure 100002_DEST_PATH_IMAGE018
the total number of times is scored for all users,
Figure 647939DEST_PATH_IMAGE010
is the full score value of the score; if the knowledge does not have user rating, then
Figure 511990DEST_PATH_IMAGE004
Default to 0.
As a still further scheme of the invention: the emotion analysis formula in the step S6 is as follows:
Figure 100002_DEST_PATH_IMAGE020
in the above-mentioned formula (3),
Figure 100002_DEST_PATH_IMAGE022
if the score is greater than 0, the user comment is positive evaluation; when the score is equal to 0, the user comments as a neutral rating; when the score is less than 0, the user comments are negative derogatory evaluation;
Figure 982154DEST_PATH_IMAGE014
is as follows
Figure 70196DEST_PATH_IMAGE016
The emotion score of each Chinese character in the emotion analysis word bank is 1 when the emotion of the Chinese character in the emotion analysis word bank is classified as 'good' or 'happy', the emotion is 0 when the emotion is classified as 'surprise', and the emotion scores of the rest characters are-1 when the emotion is classified;
Figure 540492DEST_PATH_IMAGE018
inputting the number of Chinese characters;
and the emotion scoring processing formula in the step S6 is as follows:
Figure 100002_DEST_PATH_IMAGE024
in the above-mentioned formula (4),
Figure 100002_DEST_PATH_IMAGE026
normalized values are analyzed for sentiment of user reviews,
Figure 100002_DEST_PATH_IMAGE028
is an emotion score.
As a still further scheme of the invention: and the manual credibility assessment in the step S7 is marked according to the existing policy and regulation, administrative and functional duties, work flow and other public service information which are already finished in the file structuring.
As a still further scheme of the invention: the sigmoid activation function optimization method in the step S9 is as follows:
s91, creating a regularization coefficient list [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9];
s92, respectively using the coefficients in the list for model training, and checking the coefficients with high recall rate;
and S93, selecting the coefficient with the highest recall rate as the optimal parameter of the logistic regression algorithm model.
Compared with the prior art, the invention has the beneficial effects that:
according to the invention, a small amount of manual judgment is only needed to be used as a training data set and used as model training data, so that the reliability of high-efficiency automatic labeling of mass knowledge can be realized; the efficiency is higher, the error rate is lower, the complicated work of manual labeling is liberated, and the credible evaluation accuracy of knowledge and corpus is improved;
by paying attention to five characteristics of the generation time, the revision times, the access times, the user evaluation and the scoring, the user evaluation is only one of the characteristics, and when the user evaluation is lacked, the overall influence on the evaluation result is small; therefore, the knowledge without user evaluation can be evaluated, and the limitation on the evaluation object is less;
by focusing on knowledge corpus and related information thereof, the method is not influenced by a system platform, database types, interfaces and the like, can be applied to any government hot line system, and has a large application range;
the credible classifier model is trained and optimized through a logistic regression algorithm model pair, and the performance of the model can be improved according to the increase of a training data set; after the incremental training is carried out on the model, the accuracy of automatic labeling can be continuously improved.
Drawings
FIG. 1 is a flow diagram of a method for automatically labeling a credible knowledge corpus facing a government hotline;
FIG. 2 is a schematic flow diagram of model training optimization in a government hot line-oriented credible knowledge corpus automatic labeling method.
Detailed Description
Referring to fig. 1-2, in an embodiment of the present invention, a method for automatically labeling a trusted knowledge corpus facing a government hotline includes the following steps:
s1, preparing a policy document: aiming at public service information consultation provided by a government hotline, preparing current policy and regulation, department responsibility, business matters and other public service information; for example: policy and regulation include legal administration, judicial interpretation, local regulation, department regulation and other normative documents;
department responsibilities comprise institution functions, service scope and internal institution;
business items comprise administrative law enforcement, criminal investigation and approval and public services;
other public service information comprises convenient solution and special vocabulary explanation in government affair work;
s2, policy file structuring: according to the file name, the release date, the execution period deadline, the issuing department, the action jurisdiction and the file text, carrying out structuralized processing on the file;
s3, knowledge information collection: collecting a knowledge graph of a government hotline and the generation time, revision times, access times, user evaluation and grading of each piece of knowledge in a knowledge base system; wherein, the first and the second end of the pipe are connected with each other,
the generating time is the warehousing time of the knowledge corpus;
the revising times are the times of revising each knowledge corpus which can be obtained from the knowledge base log file;
the accessed times are the times of accessing each knowledge corpus of a log file in a service system supported by a knowledge base;
user evaluation and scoring can be obtained from evaluation scoring fields corresponding to each corpus in the knowledge base;
s4, information data processing: normalizing the generation time, revision times and accessed times of each piece of knowledge; the formula of the normalization processing is as follows:
Figure DEST_PATH_IMAGE002A
in the above-mentioned formula (1),
Figure 926342DEST_PATH_IMAGE004
is a numerical value after the normalization,
Figure 188697DEST_PATH_IMAGE006
in order to obtain the value before normalization,
Figure 447640DEST_PATH_IMAGE008
and
Figure 405231DEST_PATH_IMAGE010
respectively the minimum value and the maximum value in each type of data; if the knowledge has no relevant data, then
Figure 407822DEST_PATH_IMAGE004
Defaults to 0; for example: when the number of times of accessing a knowledge corpus a is 80, the corpus B accessed most frequently in the knowledge base is 140 times, the corpus C accessed least frequently is 20 times, the number of times of accessing a after normalization processing is 0.5 (no unit after normalization processing), and the calculation is as follows:
Figure DEST_PATH_IMAGE030
s5, user grading processing: calculating an average score of the user scores of each knowledge and carrying out normalization processing; the formula of the normalization process is as follows:
Figure DEST_PATH_IMAGE012A
wherein
Figure 587000DEST_PATH_IMAGE004
A normalized numerical value of the mean value is scored for all users of a certain knowledge,
Figure 485686DEST_PATH_IMAGE014
is as follows
Figure 55207DEST_PATH_IMAGE016
The value of the credit of the individual user,
Figure 595910DEST_PATH_IMAGE018
the total number of times is scored for all users,
Figure 442643DEST_PATH_IMAGE010
is the score full value; if the knowledge does not have user score, then
Figure 43389DEST_PATH_IMAGE004
Defaults to 0; for example: when the knowledge corpus a is scored 12 times in total, the sum 96 of the scoring values is calculated as a numerator, the scoring value is multiplied by 10 times by the full scoring value of the score 12 times, the grade 12 × 10=120 is used as a denominator, and the user score after the normalization processing is calculated to be 0.8, which is calculated as follows:
Figure DEST_PATH_IMAGE032
s6, user comment sentiment analysis processing: for knowledge corpora with user comments, the first 100 Chinese characters input into the comments are used as emotion analysis corpora in a default mode, wherein the Chinese characters do not contain punctuation marks, and if the comments are less than 100 Chinese characters, all the Chinese characters are input; performing emotion analysis on each comment based on an emotion analysis word bank of HowNet to obtain an emotion score, and processing the emotion score; the emotion analysis formula is as follows:
Figure DEST_PATH_IMAGE020A
in the above-mentioned formula (3),
Figure 631365DEST_PATH_IMAGE022
if the score is greater than 0, the user comment is positive evaluation; when the score is equal to 0, the user comments as a neutral rating; when the score is less than 0, the user comments are negative derogatory evaluation;
Figure 975759DEST_PATH_IMAGE014
is as follows
Figure 801632DEST_PATH_IMAGE016
The emotion score of each Chinese character in the emotion analysis word bank is 1 when the emotion of the Chinese character in the emotion analysis word bank is classified as 'good' or 'happy', 0 when the emotion is classified as 'fright', and-1 when the rest of the emotion is classified, and the score is-1 when the emotion is classified as 'grief', 'anger', 'fear' or 'badness';
Figure 838858DEST_PATH_IMAGE018
inputting the number of Chinese characters; for example: example user comments:
thanks to the fact that a government hotline provides a convenient platform, the user can easily find a way to answer questions; it is not clear how to handle the long-distance transfer of the accumulation fund, the specific handling process and the materials needing to be prepared are found through the platform, and a great amount of time and energy are saved for me; again indicating thank you! "
Calculating the sentiment score of the user comment; wherein, the emotion classification is that 'good' or 'le' appears for 5 times, the score is 5, the emotion classification is that 'surprise' appears for 47 times, the score is 0, and the rest emotion classifications appear for 1 time, the score is-1; and performing emotion analysis on the evaluation example of the knowledge corpus, wherein the emotion score is 4, and the calculation is as follows:
Figure DEST_PATH_IMAGE034
the emotion score processing formula is as follows:
Figure DEST_PATH_IMAGE024A
in the above-mentioned formula (4),
Figure 320655DEST_PATH_IMAGE026
normalized values are analyzed for sentiment of user reviews,
Figure 796636DEST_PATH_IMAGE028
scoring the sentiment;
when the emotion score is 4, satisfy
Figure DEST_PATH_IMAGE036
After normalization processing of the above formula (4), the normalized value is analyzed for emotion
Figure 617962DEST_PATH_IMAGE026
Is 1;
s7, manual small sample labeling: randomly extracting 500 knowledge corpora from a government hotline system to perform artificial credibility assessment and marking, wherein when the artificial credibility assessment result is credible, the artificial credibility assessment result is marked as 1, and when the artificial credibility assessment result is not credible, the artificial credibility assessment result is marked as 0; manually marking whether a file name quoted in a knowledge corpus is correct, whether knowledge is in a file publishing and execution deadline date interval, whether related departments related to the knowledge conform to file content, whether a knowledge action jurisdiction conforms to file content, whether classification of the knowledge in a service list is correct, whether knowledge leaks personal privacy or other confidential information, and whether knowledge logic conforms to file text content according to the current policy and regulation, administrative duty, work flow and other public service information which have finished file structuring; if the checked content has errors, the credible evaluation is marked as 0; the system is always focused on knowledge corpora and relevant information thereof, so that the system is not influenced by a system platform, a database type, an interface and the like, and can be applied to any government hotline system; corpus example:
"Notification of repeat group stable employment work such as graduate in colleges and universities" No. 2022 document 8 of "XX city people government office about transfer X administration office published by XX city government office at 27 th 5.20.2023 No." (registration for employment of graduate in the department of colleges and universities "and" registration for employment of graduate in the nation "are no longer issued from 2023 th year"
Manual verification:
manually marking date information related to the knowledge corpus, namely the published information of the notification and the specified starting date; checking whether the date information is correct or not and checking whether the knowledge corpus is in the validity period or not; the release date is 2022, 5 months and 27 days, the execution is started from 2023, and the date information is correct;
manual labeling needs to check whether a policy issuing department related to the knowledge corpus is correct or not; the policy is issued by XX city government office, and the information is correct;
manual labeling requires checking whether the policy file name involved in the knowledge corpus is correct; the file name is correct;
manually marking whether action objects and action jurisdictions of knowledge linguistic data are in accordance with policy documents or not; the corpus has no related information and cannot be verified;
manual labeling needs to check whether the knowledge type of the knowledge corpus is accurate or not; slightly;
manual labeling needs to check whether the policy text information related in the knowledge corpus is correct and complete or whether privacy is leaked; the information is correct;
labeling results:
through the information verification, the knowledge corpus is manually marked as 1;
s8, training a credible classifier model: because the generation time, the revision times, the access times, the user evaluation and the score of each piece of knowledge have no strong association relationship, the normalization processing results of the generation time, the revision times, the access times, the user evaluation and the score of each piece of knowledge in the steps S4, S5 and S6 are used as the input tensor of the logistic regression algorithm model through the logistic regression algorithm and matched with the manually labeled labels to carry out the training of the credible classifier model, the user evaluation is only one of the characteristics, and when the user evaluation characteristics are lacked, the overall influence on the evaluation result is small, so that the knowledge without the user evaluation can be evaluated, and the limitation on the evaluation object is less; tensor example per knowledge corpus input: [0.7, 0.9, 0.5, 1, 0.8];
s9, model optimization: optimizing a sigmoid activation function in a logistic regression algorithm model to adjust a regularization coefficient, so as to optimize a display result of a confusion matrix appearing in the training of a credible classifier model;
s10, automatic annotation generation: and inputting the normalized values of the generation time, revision times, access times, user evaluation and scoring of all the collected and processed knowledge as tensors into a credible classifier model, wherein the credible classifier model automatically deduces the credibility value of the knowledge to realize credible automatic labeling of knowledge linguistic data, the credibility value interval is [0, 1], the closer the credibility value is to 1, the higher the credibility is, and the knowledge user can perform custom threshold screening, sequencing and the like according to the credibility value of the labeled knowledge to obtain the credible knowledge linguistic data most expected by the user because the interval is [0, 1] is labeled for the finally evaluated knowledge credibility.
Preferably, the human credibility assessment in step S7 is labeled according to the current policy and regulation, administrative and functional duties, work flow and other public service information that have completed the document structuring.
Preferably, the sigmoid activation function optimization method in the step S9 is as follows:
s91, creating a regularization coefficient list [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9];
s92, respectively using the coefficients in the list for model training, and checking the coefficients with high recall rate;
and S93, selecting the coefficient with the highest recall rate as the optimal parameter of the logistic regression algorithm model.
The above embodiments are only preferred embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equally replaced or changed within the scope of the present invention.

Claims (6)

1. A credible knowledge corpus automatic labeling method facing to a government hotline is characterized by comprising the following steps:
s1, preparing a policy document: aiming at public service information consultation provided by a government hotline, preparing the current policy and regulation, department responsibility, business matters and other public service information;
s2, policy file structuring: according to the file name, the release date, the execution period deadline, the issuing department, the action jurisdiction and the file text, carrying out structuralized processing on the file;
s3, knowledge information collection: collecting a knowledge graph of a government hotline and the generation time, revision times, access times, user evaluation and grading of each piece of knowledge in a knowledge base system;
s4, information data processing: normalizing the generation time, the revision times and the accessed times of each piece of knowledge;
s5, user grading processing: calculating an average score of the user scores of each knowledge and carrying out normalization processing;
s6, analyzing and processing the comment emotion of the user: for knowledge corpora with user comments, the first 100 Chinese characters input into the comments are used as emotion analysis corpora by default, and if the comments are less than 100 Chinese characters, all the Chinese characters are input; performing sentiment analysis on each comment based on a sentiment analysis word bank of HowNet to obtain a sentiment score, and processing the sentiment score;
s7, manual small sample labeling: randomly extracting 500 knowledge corpora from a government hot line system to perform artificial credibility assessment and marking, wherein when the artificial credibility assessment result is credible, the artificial credibility assessment result is marked as 1, and when the artificial credibility assessment result is not credible, the artificial credibility assessment result is marked as 0; checking whether the file name quoted in the knowledge corpus is correct, whether knowledge is in a file publishing and execution deadline interval, whether related departments related to the knowledge conform to file content, whether a knowledge action district conforms to file content, whether classification of the knowledge in a service list is correct, whether personal privacy or other confidential information is leaked from the knowledge, and whether knowledge logic conforms to file text content; if the checked content has errors, the credible evaluation is marked as 0;
s8, training a credible classifier model: taking the normalized processing results of the generation time, revision times, access times, user evaluation and scoring of each knowledge in the steps S4, S5 and S6 as the input tensor of the logistic regression algorithm model through the logistic regression algorithm, and performing credible classifier model training by matching with manually labeled labels;
s9, model optimization: optimizing a sigmoid activation function in a logistic regression algorithm model to adjust a regularization coefficient, so as to optimize a display result of a confusion matrix appearing in the training of a credible classifier model;
s10, automatic annotation generation: and inputting the collected and processed normalization values of the generation time, revision times, access times, user evaluation and scoring of all knowledge into a credible classifier model as tensor, wherein the credible classifier model automatically deduces the credibility value of the knowledge to realize credible automatic labeling of knowledge corpus, and the interval of the credibility value is [0, 1].
2. The method for automatically labeling the credible knowledge corpus of the government hotline according to claim 1, wherein the normalization processing in the step S4 is as follows:
Figure DEST_PATH_IMAGE002
in the above-mentioned formula (1),
Figure DEST_PATH_IMAGE004
is a numerical value after the normalization,
Figure DEST_PATH_IMAGE006
is a numerical value before the normalization is carried out,
Figure DEST_PATH_IMAGE008
and
Figure DEST_PATH_IMAGE010
respectively the minimum value and the maximum value in each type of data; if the knowledge has no relevant data, then
Figure 716165DEST_PATH_IMAGE004
Default to 0.
3. The method for automatically labeling the credible knowledge corpus of the government hotline according to claim 1, wherein the normalization processing in the step S5 is as follows:
Figure DEST_PATH_IMAGE012
wherein
Figure 950837DEST_PATH_IMAGE004
A normalized numerical value of the mean value is scored for all users of a certain knowledge,
Figure DEST_PATH_IMAGE014
is as follows
Figure DEST_PATH_IMAGE016
The value of the credit of the individual user,
Figure DEST_PATH_IMAGE018
the total number of times is scored for all users,
Figure 761667DEST_PATH_IMAGE010
is the score full value; if the knowledge does not have user rating, then
Figure 703078DEST_PATH_IMAGE004
Default is to0。
4. The method for automatically labeling the credible knowledge corpus of the government hotline according to claim 1, wherein the emotion analysis formula in the step S6 is as follows:
Figure DEST_PATH_IMAGE020
in the above-mentioned formula (3),
Figure DEST_PATH_IMAGE022
if the score is greater than 0, the user comment is positive evaluation; when the score is equal to 0, the user comments as a neutral rating; when the score is less than 0, the user comments are negative dereferencing evaluation;
Figure 472320DEST_PATH_IMAGE014
is a first
Figure 866392DEST_PATH_IMAGE016
The emotion scores of the Chinese characters in the emotion analysis word bank are 1 when the emotions of the Chinese characters in the emotion analysis word bank are classified into 'good' or 'happy', 0 when the emotions are classified into 'surprise', and-1 when the rest emotions are classified;
Figure 949755DEST_PATH_IMAGE018
inputting the number of Chinese characters;
and the emotion score processing formula in the step S6 is as follows:
Figure DEST_PATH_IMAGE024
in the above-mentioned formula (4),
Figure DEST_PATH_IMAGE026
normalized values are analyzed for sentiment of user reviews,
Figure DEST_PATH_IMAGE028
is an emotion score.
5. The method for automatically labeling the credible knowledge corpus of the government hotline according to claim 1, wherein the manual credible assessment labeling in the step S7 is based on the current policy and regulation, administrative and functional duties, work flow and other public service information which have been structured.
6. The method for automatically labeling the credible knowledge corpus facing the government hotline according to claim 1, wherein the sigmoid activation function optimization method in the step S9 is as follows:
s91, creating a regularization coefficient list [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9];
s92, respectively using the coefficients in the list for model training, and checking the coefficients with high recall rate;
and S93, selecting the coefficient with the highest recall rate as the optimal parameter of the logistic regression algorithm model.
CN202211002539.1A 2022-08-22 2022-08-22 Credible knowledge corpus automatic labeling method facing government hotline Pending CN115309902A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211002539.1A CN115309902A (en) 2022-08-22 2022-08-22 Credible knowledge corpus automatic labeling method facing government hotline

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211002539.1A CN115309902A (en) 2022-08-22 2022-08-22 Credible knowledge corpus automatic labeling method facing government hotline

Publications (1)

Publication Number Publication Date
CN115309902A true CN115309902A (en) 2022-11-08

Family

ID=83862365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211002539.1A Pending CN115309902A (en) 2022-08-22 2022-08-22 Credible knowledge corpus automatic labeling method facing government hotline

Country Status (1)

Country Link
CN (1) CN115309902A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117131161A (en) * 2023-10-24 2023-11-28 北京社会管理职业学院(民政部培训中心) Electric wheelchair user demand extraction method and system and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117131161A (en) * 2023-10-24 2023-11-28 北京社会管理职业学院(民政部培训中心) Electric wheelchair user demand extraction method and system and electronic equipment

Similar Documents

Publication Publication Date Title
CN109684440B (en) Address similarity measurement method based on hierarchical annotation
CN110147436B (en) Education knowledge map and text-based hybrid automatic question-answering method
CN109597994B (en) Short text problem semantic matching method and system
CN110807328A (en) Named entity identification method and system oriented to multi-strategy fusion of legal documents
CN111767716B (en) Method and device for determining enterprise multi-level industry information and computer equipment
CN109960727B (en) Personal privacy information automatic detection method and system for unstructured text
CN112632989B (en) Method, device and equipment for prompting risk information in contract text
CN103207855A (en) Fine-grained sentiment analysis system and method specific to product comment information
CN111260223A (en) Intelligent identification and early warning method, system, medium and equipment for trial and judgment risk
US20190139147A1 (en) Accuracy and speed of automatically processing records in an automated environment
CN110674970A (en) Enterprise legal risk early warning method, device, equipment and readable storage medium
CN111401040A (en) Keyword extraction method suitable for word text
CN110532398A (en) Family's map method for auto constructing based on multitask united NNs model
Vézina et al. An overview of the BALSAC population database. Past developments, current state and future prospects
CN112258144B (en) Policy file information matching and pushing method based on automatic construction of target entity set
CN109492097B (en) Enterprise news data risk classification method
CN112347254B (en) Method, device, computer equipment and storage medium for classifying news text
CN115309902A (en) Credible knowledge corpus automatic labeling method facing government hotline
CN113536780A (en) Intelligent auxiliary case judging method for enterprise bankruptcy cases based on natural language processing
Mutiara et al. Improving the accuracy of text classification using stemming method, a case of non-formal Indonesian conversation
CN116777607B (en) Intelligent auditing method based on NLP technology
CN112835910B (en) Method and device for processing enterprise information and policy information
CN112257442B (en) Policy document information extraction method based on corpus expansion neural network
CN117290508A (en) Post-loan text data processing method and system based on natural language processing
CN111950286A (en) Development method of artificial intelligent legal review engine system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination