CN115357769B - Government affair data labeling system and method - Google Patents

Government affair data labeling system and method Download PDF

Info

Publication number
CN115357769B
CN115357769B CN202210922987.7A CN202210922987A CN115357769B CN 115357769 B CN115357769 B CN 115357769B CN 202210922987 A CN202210922987 A CN 202210922987A CN 115357769 B CN115357769 B CN 115357769B
Authority
CN
China
Prior art keywords
data
layer
training
sample
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210922987.7A
Other languages
Chinese (zh)
Other versions
CN115357769A (en
Inventor
严洪涛
张军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Zhiwei Digital Technology Co ltd
Original Assignee
Wuxi Zhiwei Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Zhiwei Digital Technology Co ltd filed Critical Wuxi Zhiwei Digital Technology Co ltd
Priority to CN202210922987.7A priority Critical patent/CN115357769B/en
Publication of CN115357769A publication Critical patent/CN115357769A/en
Application granted granted Critical
Publication of CN115357769B publication Critical patent/CN115357769B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a government affair data labeling system and a government affair data labeling method, which are characterized in that with the development of the Internet of things and big data technology, the information quantity of government affair data is larger and larger, the data formats and types formed by different business systems are different, and the government affair data labeling system and the government affair data labeling method can not be completed almost by manual identification. According to the invention, the original data is sampled, the preset type mark is added, the training model is input, and the data category is output after the training model is classified, so that automatic classification and identification of mass data can be realized, and a foundation is established for subsequent data retrieval application. The method is also applicable to blind classification, blind identification and data type subdivision of other types of data information or data information generated by other types of service systems.

Description

Government affair data labeling system and method
Technical Field
The invention relates to the technical fields of the Internet of things and big data, in particular to a government affair data labeling system and method.
Background
With the development of the internet of things and big data technology, the types of data are more and more, the data volume is more and more, and the automatic classification of unknown data is more and more important. The traditional data classification is mainly used for carrying out standardized processing on data in the modes of interface butt joint, database butt joint and manual identification, the method is low in efficiency, the intrusion on the original service system is large, the unified standardization of the whole disk is difficult to achieve, the service systems are always independent and standard, temporary conversion is needed when the butt joint is carried out, and effective global data management cannot be carried out.
For government affair data, the actual types are not too much, but as each government affair information system is independently developed, the same government affair data are inconsistent in marks in different places, different time periods and different government affair information systems, the standards are not uniform, the formats possibly also have differences, for example, the field name of an A system to an identity card number is sfz, a B system is userID, a C system is an old identity card number and the like, for example, a system A is represented by 'year-month-day' for the date of birth, a system B is represented by 'year-month-day', a system C is represented by 'year-month-day', and the like, and the unified classification identification is lacking; when the system D needs related information, unless the representation modes of different systems are accurately known, only the information item by item can be used for checking whether useful information exists, and whether similar information exists in different systems cannot be determined, so that a blind recognition and automatic recognition means for unknown data is lacked. In addition, with the development of informatization technology, the huge data lacks the sign, relies on the huge work load of manual sign entirely, hardly accomplishes the unified standard of full disk.
Disclosure of Invention
In view of the above, the present invention provides a system and a method for labeling government affair data, which can effectively solve the above-mentioned problems in the prior art.
The invention relates to a government affair data labeling system and a government affair data labeling method, which are characterized in that the system and the method are characterized in that the original data are sampled, a preset type mark is added, a training model is input, and the data category is output after the training model is classified; the data labels comprise data categories output by the training model; the original data sampling includes intercepting a number of paragraphs of original data; the preset types comprise numbers, chinese characters, english characters, number and character mixture, pictures, videos, texts, other types and the like, and the other types refer to all other types except the types of the numbers, the Chinese characters, the English characters, the number and character mixture, the pictures, the videos and the texts; the training model comprises a neural network model, and the government data labeling method comprises the following steps of:
s0: training model weight parameter estimation, including empirical estimation or training model self-learning estimation, wherein the empirical estimation comprises manually estimating each weight parameter value of the model according to personal experience; the step is completed before the step S1 is executed, but the step is not required to be executed before the step S1 is executed each time, the subsequent multiple times of data labeling work can be supported after the weight parameter estimation is completed, and only the proper weight parameter estimation value is required before the data labeling work is executed;
s1: collecting data, namely collecting service system data to obtain various data, wherein the service system data comprises streaming or non-streaming data, structured or unstructured data, document data, internet data and the like;
s2: data preprocessing, including performing original data sampling and attaching a preset type mark, wherein the original data sampling includes intercepting previous continuous d bits (bits) of an original data unit; the original data unit comprises any one or more of a data file, a document, a database, a data table similar field (possibly unknown class) and the like; the adding preset type marks comprise adding a plurality of digits of preset type identification codes in front of the sampling number, and if the preset 8 types can be represented by 3 binary numbers; since the number of bytes occupied by a general data file is several KB (Byte) or even several MB, or even more, the most accurate feature information can be obtained by performing feature recognition on all data of a file, but the calculation amount and the consumed time may be an astronomical number. Therefore, the d value is determined by considering both the calculated amount and the calculation efficiency, and the effectiveness of feature extraction, and is not too long or too short, so that the calculated amount is too large, and the features of the original data unit can not be completely covered by the short data unit; a smaller value of d, while may represent a data characteristic of a portion of the original data unit, may be difficult to extract its valid characteristic for data items or image data having a longer format field level;
s3: data analysis, namely respectively calculating classification information of each group of sampling data according to the weight parameters estimated by the training model parameter estimation method;
for each group of input preprocessed sampling data, counting the total data bit number as m bits, wherein m=d+b, and b is the bit number of a preset type mark; for each set of inputs, record x i Taking binary 0 or 1, i is more than or equal to 1 and less than or equal to m for the ith bit value of the input data after the addition of a preset type mark (i is only used for inputting the data at the place); the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, the output layer is not calculated into the total layer number, and each layer has s l The number of the activation items is equal to or more than 1 and equal to or less than L, and the number of the activation items of the last layer is consistent with the expected total classification number K, namely s 1 =m,s L =k; training model layer l, layer j to layer l+1, layer i weights transferred by trainingThe parameters are
Figure BDA0003778489240000031
1≤j≤s l ,1≤i≤s l+1
First layer ith bit input value
Figure BDA0003778489240000032
Second layer ith intermediate training value +.>
Figure BDA0003778489240000033
The ith intermediate training value of the first layer is +.>
Figure BDA0003778489240000034
1≤i≤s l There is
Figure BDA0003778489240000035
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003778489240000036
for the weight parameter vector when all intermediate training values of the first layer are transferred to the ith training value of the first layer (1) layer, the superscript T is matrix transposition, and the +.>
Figure BDA0003778489240000037
X l Intermediate training values, otherwise known as intermediate quantities,
Figure BDA0003778489240000041
the final layer output value is obtained after the layer-by-layer calculation
Figure BDA0003778489240000042
1≤i≤K;
S4: outputting the classification result, namely the data label h, when
Figure BDA0003778489240000043
When taking h i =0; when->
Figure BDA0003778489240000044
Figure BDA0003778489240000045
When taking h i =1; when->
Figure BDA0003778489240000046
When the data is abnormal, the data does not belong to the trained data category or the classification model needs to be retrained; wherein K is 1 0 probability error range, K 2 A probability error range of 1; classification result h= [ h ] 1 ,h 2 ,…,h K ]The classification result is also called a subdivision data type relative to a preset type;
s5: labeling and archiving the data, labeling the original data unit corresponding to the sampled data by using the classification result h, and archiving and storing the labeled original data unit.
And repeating the steps S1 to S5 for each group of original data units, so that labeling of all original data can be completed.
Preferably, the training model self-learning estimation in step S0 includes a neural network training model parameter estimation method, including the following steps:
s01: acquiring sample data, recording the total number of samples as N, and recording the data label of the nth sample
Figure BDA0003778489240000047
It is known that when the sample belongs to the kth class, -/->
Figure BDA0003778489240000048
Otherwise->
Figure BDA0003778489240000049
Wherein N is more than or equal to 1 and less than or equal to N, and K is more than or equal to 1 and less than or equal to K;
s02: sample data preprocessing, including sample data sampling and adding a preset type mark, wherein the sample data sampling includes intercepting the previous continuous d bits of a sample data unit; the sample data unit comprises any one or more of a data file, a document, a database, a data table similar field (possibly unknown class) and the like; the adding of the preset type mark comprises adding a plurality of bits of preset type identification codes in front of the sampling number;
s03: sample training, calculating training values of each layer of the neural network model
For each group of input preprocessed sample sampling data, recording the data bit number as m bits, wherein m=d+b, and b is the bit number of a preset type mark; x is x i Taking binary 0 or 1, i is more than or equal to 1 and less than or equal to m for the ith bit value of the input data after the addition of a preset type mark (i is only used for inputting the data at the place); the neural network training model has L training layers, wherein the first layer is an input layer and the last layer is connected with an output layer, the output layer does not count the total layer number, and each layer has s l The number of the activation items is equal to or more than 1 and equal to or less than L, and the number of the activation items of the last layer is consistent with the expected total classification number K, namely s 1 =m,s L =k; the weight parameters of the training transfer from the jth position of the layer I to the (l+1) th position of the training model are as follows
Figure BDA0003778489240000051
1≤j≤s l ,1≤i≤s l+1
Inputting sample data and a weight parameter initial value;
Figure BDA0003778489240000052
calculating the ith intermediate value of the first layer as
Figure BDA0003778489240000053
1≤i≤s l There is
Figure BDA0003778489240000054
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003778489240000055
is the firstWeight parameter vector when all intermediate training values of layer are transferred to ith training value of layer 1+1th, superscript T is matrix transposition, < >>
Figure BDA0003778489240000056
X l Intermediate training values, otherwise known as intermediate quantities,
Figure BDA0003778489240000057
calculating layer by layer until the final layer output value is obtained
Figure BDA0003778489240000058
I is more than or equal to 1 and less than or equal to K; recording the training result of the nth sample
Figure BDA0003778489240000059
Marked as->
Figure BDA00037784892400000510
1≤n≤N;
Repeating the above step S33 for each set of input preprocessed sample data;
s04: calculating a cost function, calculating a cost function J (Θ) for all training results and weight parameters, Θ being all
Figure BDA0003778489240000061
Is a collection of (3);
Figure BDA0003778489240000062
wherein lambda is a deviation penalty parameter, lambda is more than or equal to 0 and less than or equal to 1, lambda is an empirical parameter, and generally is a small amount; the above cost function is essentially a sum of the three terms of the positive bias expectation, the negative bias expectation and the bias penalty term of the probability;
s05: optimizing weight parameters for all weight parameters
Figure BDA0003778489240000063
Optimizing J (Θ)Minimizing J (Θ), i.e
Θ=minimizeJ(Θ)
The optimizing comprises automatic optimizing or weight parameter combination traversing optimizing of general machine learning software; the weight parameter combination traversal optimization includes, for each
Figure BDA0003778489240000064
Repeatedly executing steps S03 and S04 one by one within the range of values according to the designated step length, and taking J (Θ) minimum +.>
Figure BDA0003778489240000065
As a result of the optimizing.
The second aspect is characterized by at least comprising six modules, namely parameter estimation, data acquisition, data preprocessing, data analysis, label output and data labeling, which are sequentially connected, wherein the six modules sequentially execute steps S0, S1, S2, S3, S4 and S5 respectively. The government data tagging system is essentially a computer system.
Further, the parameter estimation module includes five modules for acquiring sample data, preprocessing the sample data, training the sample, calculating a cost function, and optimizing a weight parameter, and the five modules sequentially execute steps S01, S02, S03, S04, and S05, respectively.
The invention has the advantages and beneficial effects that: the invention provides a government affair data labeling system and a government affair data labeling method, which provide an automatic classification and identification method for unknown class and mass government affair data, can reduce a large amount of manual identification workload, and can even finish a large amount of identification work which cannot be finished manually. After the identification is finished, each piece of original data is marked with a corresponding data type label, so that the subsequent data retrieval and professional application requirements can be facilitated. The method is also applicable to blind classification, blind identification and data type subdivision of other types of data information or data information generated by other types of service systems.
Drawings
FIG. 1 is a flow chart of a method for tagging government affair data;
FIG. 2 is a flow chart of a weight parameter optimizing estimation method;
FIG. 3 is a schematic block diagram of a neural network training model;
fig. 4 is a schematic block diagram of a government data tagging system.
Detailed Description
The following describes the embodiments of the present invention further with reference to the drawings and examples. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
Example 1: government affair data labeling method
As shown in FIG. 1, the system and the method for labeling government affair data are characterized in that the system and the method input training models after sampling original data and adding a preset type mark, and output data types after being classified by the training models; the data labels comprise data categories output by the training model; the original data sampling includes intercepting a number of paragraphs of original data; the preset types comprise numbers, chinese characters, english characters, number and character mixture, pictures, videos, texts, other types and the like, and the other types refer to all other types except the types of the numbers, the Chinese characters, the English characters, the number and character mixture, the pictures, the videos and the texts; the training model comprises a neural network model, and the government data labeling method comprises the following steps of:
s0: training model weight parameter estimation, including empirical estimation or training model self-learning estimation, wherein the empirical estimation comprises manually estimating each weight parameter value of the model according to personal experience; in the embodiment, the weight parameter estimation is performed by self-learning estimation through a neural network training model, and an initial value during estimation can be an empirical value, so that the progress of optimizing can be quickened;
s1: collecting data, namely collecting service system data to obtain various data, wherein the service system data comprises streaming or non-streaming data, structured or unstructured data, document data, internet data and the like;
s2: data preprocessing, including performing original data sampling and attaching a preset type mark, wherein the original data sampling includes intercepting previous continuous d bits (bits) of an original data unit; the original data unit comprises any one or more of a data file, a document, a database, a data table similar field (possibly unknown class) and the like; the adding of the preset type mark comprises adding a plurality of bits of preset type identification codes in front of the sampling number; since the number of bytes occupied by a general data file is several KB (Byte) or even several MB, or even more, the most accurate feature information can be obtained by performing feature recognition on all data of a file, but the calculation amount and the consumed time may be an astronomical number. Therefore, the d value is determined by considering both the calculated amount and the calculation efficiency, and the effectiveness of feature extraction, and is not too long or too short, so that the calculated amount is too large, and the features of the original data unit can not be completely covered by the short data unit; a smaller value of d, while may represent a data characteristic of a portion of the original data unit, may be difficult to extract its valid characteristic for data items or image data having a longer format field level; the host computer of the embodiment adopts a workstation or a server, has stronger computing capability, generally takes d=800 bits, presets general types such as numbers, chinese characters, english characters, mixed numbers and characters, pictures, videos, texts, other and the like, and is marked by 3-bit binary numbers 000, 001, 010, 011, 100, 101, 110 and 111 in sequence;
s3: data analysis, namely respectively calculating classification information of each group of sampling data according to the weight parameters estimated by the training model parameter estimation method;
for each set of input preprocessed sample data, the total number of data bits is m bits, m=d+b, where b is the number of bits of the preset type flag, in this embodiment b=3, d=800, m=803; for each set of inputs, record x i Taking binary 0 or 1, i is more than or equal to 1 and less than or equal to m for the ith bit value of the input data after the addition of a preset type mark (i is only used for inputting the data at the place); the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, the output layer is not calculated into the total layer number, and each layer has s l The number of the activation items is equal to or more than 1 and equal to or less than L, and the number of the activation items of the last layer is consistent with the expected total classification number K, namely s 1 =m,s L =k; the present embodiment uses a 5-layer neural network training model (no output layer is calculated), i.e., l=5, and s 1 =s 2 =s 3 =s 4 =m=803,s 5 =k=20; the weight parameters of the training transfer from the jth position of the layer I to the (l+1) th position of the training model are as follows
Figure BDA0003778489240000091
1≤j≤s l ,1≤i≤s l+1
First layer ith bit input value
Figure BDA0003778489240000092
Second layer ith intermediate training value +.>
Figure BDA0003778489240000093
The ith intermediate training value of the first layer is +.>
Figure BDA0003778489240000094
1≤i≤s l There is
Figure BDA0003778489240000095
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003778489240000096
for the weight parameter vector when all intermediate training values of the first layer are transferred to the ith training value of the first layer (1) layer, the superscript T is matrix transposition, and the +.>
Figure BDA0003778489240000097
X l Intermediate training values, otherwise known as intermediate quantities,
Figure BDA0003778489240000098
the final layer output value is obtained after the layer-by-layer calculation
Figure BDA0003778489240000099
1≤i≤K;
S4: outputting the classification result, namely the data label h, when
Figure BDA00037784892400000910
When taking h i =0; when->
Figure BDA00037784892400000911
Figure BDA00037784892400000912
When taking h i =1; when->
Figure BDA00037784892400000913
When the data is abnormal, the data does not belong to the trained data category or the classification model needs to be retrained; wherein K is 1 0 probability error range, K 2 A probability error range of 1; classification result h= [ h ] 1 ,h 2 ,…,h K ]The classification result is also called a subdivision data type relative to a preset type; if the preset digital type is subdivided, the embodiment further comprises three subclasses of identity information, position information and statistical information, and the subdivided data class mainly depends on the possible classification requirement of the actual data; the embodiment takes K 1 =0.1,K 2 =0.9, classification result h is a matrix of 1×20;
s5: labeling and archiving the data, labeling the original data unit corresponding to the sampled data by using the classification result h, and archiving and storing the labeled original data unit.
And repeating the steps S1 to S5 for each group of original data units, so that labeling of all original data can be completed.
Further, as shown in fig. 2 and 3, the training model self-learning estimation in step S0 includes a neural network training model parameter estimation method, which includes the following steps:
s01: acquiring sample data, recording the total number of samples as N, and recording the data label of the nth sample
Figure BDA0003778489240000101
It is known that when the sample belongs to the kth class, -/->
Figure BDA0003778489240000102
Otherwise->
Figure BDA0003778489240000103
Wherein N is more than or equal to 1 and less than or equal to N, and K is more than or equal to 1 and less than or equal to K;
s02: sample data preprocessing, including sample data sampling and adding a preset type mark, wherein the sample data sampling includes intercepting the previous continuous d bits of a sample data unit; the sample data unit comprises any one or more of a data file, a document, a database, a data table similar field (possibly unknown class) and the like; the adding of the preset type mark comprises adding a plurality of bits of preset type identification codes in front of the sampling number;
s03: sample training, calculating training values of each layer of the neural network model
For each set of input preprocessed sample data, the number of data bits is m, m=d+b, where b is the number of bits of the preset type flag, in this embodiment b=3, d=800, m=803; x is x i Taking binary 0 or 1, i is more than or equal to 1 and less than or equal to m for the ith bit value of the input data after the addition of a preset type mark (i is only used for inputting the data at the place); the neural network training model has L training layers, wherein the first layer is an input layer and the last layer is connected with an output layer, the output layer does not count the total layer number, and each layer has s l The number of the activation items is equal to or more than 1 and equal to or less than L, and the number of the activation items of the last layer is consistent with the expected total classification number K, namely s 1 =m,s L =k; the present embodiment uses a 5-layer neural network training model (no output layer is calculated), i.e., l=5, and s 1 =s 2 =s 3 =s 4 =m=803,s 5 =k=20; the weight parameters of the training transfer from the jth position of the layer I to the (l+1) th position of the training model are as follows
Figure BDA0003778489240000111
1≤j≤s l ,1≤i≤s l+1
The initial values of the sample data and the weight parameters are input, the initial values are all 0.5 when the sample data and the weight parameters are tested for the first time, and the initial values are used as one group of weight parameter results tested for the next time in the subsequent debugging process, so that the group of parameters are considered to be more reasonable, and the testing and debugging process can be accelerated;
Figure BDA0003778489240000112
calculating the ith intermediate value of the first layer as
Figure BDA0003778489240000113
1≤i≤s l There is
Figure BDA0003778489240000114
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003778489240000115
for the weight parameter vector when all intermediate training values of the first layer are transferred to the ith training value of the first layer (1) layer, the superscript T is matrix transposition, and the +.>
Figure BDA0003778489240000116
X l Intermediate training values, otherwise known as intermediate quantities,
Figure BDA0003778489240000117
calculating layer by layer until the final layer output value is obtained
Figure BDA0003778489240000118
I is more than or equal to 1 and less than or equal to K; recording the training result of the nth sample
Figure BDA0003778489240000119
Marked as->
Figure BDA00037784892400001110
1≤n≤N;
Repeating the above step S33 for each set of input preprocessed sample data;
s04: calculating a cost function, calculating a cost function J (Θ) for all training results and weight parameters, Θ being all
Figure BDA00037784892400001111
Is a collection of (3);
Figure BDA0003778489240000121
wherein lambda is a deviation penalty parameter, lambda is more than or equal to 0 and less than or equal to 1, lambda is an empirical parameter, and generally is a small amount, and the value of the embodiment is 0.001; the above cost function is essentially a sum of the three terms of the positive bias expectation, the negative bias expectation and the bias penalty term of the probability;
in the embodiment, the total number of samples is n=30 ten thousand, 20 ten thousand are used for model training, 10 ten thousand are used for model testing after training, and when the test result is unsatisfactory, the parameters are adjusted for training and testing again;
s05: optimizing weight parameters for all weight parameters
Figure BDA0003778489240000122
Optimizing J (Θ) so that J (Θ) is minimized, i.e
Θ=minimizeJ(Θ)
The optimizing comprises automatic optimizing or weight parameter combination traversing optimizing of general machine learning software; the weight parameter combination traversal optimization includes, for each
Figure BDA0003778489240000123
Repeatedly executing steps S03 and S04 one by one within the range of values according to the designated step length, and taking J (Θ) minimum +.>
Figure BDA0003778489240000124
As the result of the optimizing; the embodiment adopts a general neural networkTraining model software (such as a TensorFlow open source machine learning platform), the principle is as shown in fig. 3 and 4, after defining the training layer number and each layer of activation items, the software can automatically perform machine learning and output optimized weight parameter vector Θ and the value of sample function J (Θ).
In the embodiment, the weight parameter estimation carries out self-learning estimation through a neural network training model, sample data is divided into two parts in the actual estimation process, one part is used for training the weight parameter estimation, and the other part is used for testing the availability of an estimation result; all sample data are known to be pre-classified, and the classification method and system have practical usability when the test qualification rate reaches more than 90%; the automatic classification result of the test data is inconsistent with the pre-result and is generally manually analyzed, the cause is found, and model parameters are adjusted if necessary; the inaccurate classification result is mainly caused by insufficient sampling digits, and the original data characteristics cannot be obviously represented; the reasons for the inaccurate classification may be, on the one hand, an insufficient number of sampled bits, on the other hand, that the original data does not match the intended subdivision class, or that at least an uncertain disturbance term is present in the sampled result of the original data.
The present embodiment generally does not retrain after the test training is completed, as retraining may mean that the previous classification results need to be reclassified, which requires case-specific analysis. To ensure availability of the classification system, this embodiment requires that the test yield be above 99%, all parameters and weights are curable. When special conditions are unable to be classified, the data can be manually analyzed to find reasons, other parameters are not changed except for temporary adjustment of sampling digits without special requirements, and the reasons are all classified as problem data.
Example 2: government affair data tagging system
As shown in fig. 1, the government affair data labeling system is characterized by at least comprising six modules, namely parameter estimation, data acquisition, data preprocessing, data analysis, label output and data labeling, which are sequentially connected, wherein the six modules sequentially execute steps S0, S1, S2, S3, S4 and S5 respectively. The government data tagging system is essentially a computer system.
Further, as shown in fig. 2, the parameter estimation module includes five modules, including sample data acquisition, sample data preprocessing, sample training, cost function calculation, and weight parameter optimization, where the five modules sequentially execute steps S01, S02, S03, S04, and S05, respectively.
The working principle of the system is shown in fig. 4, the model training adopts sample data, and the model application aims at actual data after training results are obtained.
With regard to the automatic classification method and system of government data in embodiments 1 and 2, the general weight parameters are not adjusted after the actual application is completed, and particularly for the large class of government data, the model training is performed again to estimate the weight parameters mainly for the new large class of original data to be subdivided, and the subdivision standard may be different from that of the original government data. At this time, not only model training is needed again, but also parameters such as the number of sampling bits, the preset type, the expected total classification number and the like are needed to be reset. The method of the invention can thus be applied to the automatic subdivision of other kinds of data information in fact.
The basic principle of the invention is as follows: determining the number of expected data labels according to the total possible subdivision category number requirements of government affair data; presetting general data types according to possible informatization representation modes of the original data unit, such as numbers, chinese characters, english characters, mixed numbers and characters, pictures, videos, texts and the like; collecting the explicitly classified original data as a training sample to train the model, and acquiring reasonable weight parameters; and classifying and identifying the original data units to be classified by using the trained model, and archiving after labeling is completed.
The application of the invention: classifying the data through the model, and establishing labels capable of being normally classified to ensure the usability of the data; carrying out deep analysis on the data which cannot be matched, and trying fuzzy matching; the repeated data can be cleaned; archiving dirty data or polluted data, feeding back related service systems, and trying to acquire standard service data for the second time; finally, the data which still cannot be matched are archived and checked; the system background performs statistical analysis on the matched tag data, can change monitoring and data interface service, and realizes intelligent data retrieval and supervision. For the data with non-ideal classification results, on one hand, the data can be directly classified into problem data or other types, on the other hand, the data can be manually analyzed, and the model can be trained again if necessary; for other types of data information or data information generated by other types of service systems, the method is applied to further subdivide data categories, and mainly the model parameters need to be correspondingly designed, such as elements of sampling bit number, preset type, number of expected total classifications and the like.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, modifications and variations can be made without departing from the technical principles of the present invention, including selection method, initialization method, etc. of parameters, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (4)

1. A government affair data labeling method is characterized in that the original data is sampled, a preset type mark is added, a training model is input, and the data category is output after the training model is classified; the data labels comprise data categories output by the training model; the original data sampling includes intercepting a number of paragraphs of original data; the preset types comprise numbers, chinese characters, english characters, mixture of numbers and characters, pictures, videos and texts; the training model comprises a neural network model, and the government data labeling method comprises the following steps of:
s0: training model weight parameter estimation, including empirical estimation or training model self-learning estimation, wherein the empirical estimation comprises manually estimating each weight parameter value of the model according to personal experience;
s1: collecting data, namely collecting service system data, wherein the service system data comprises streaming or non-streaming data, structured or unstructured data, document data and internet data;
s2: the data preprocessing comprises the steps of carrying out original data sampling and attaching a preset type mark, wherein the original data sampling comprises the steps of intercepting the previous continuous d bits of an original data unit; the original data unit comprises any one or more of data files, documents, databases, data tables and data table similar fields; the adding of the preset type mark comprises adding a plurality of bits of preset type identification codes in front of the sampling number;
s3: data analysis, namely respectively calculating classification information of each group of sampling data according to the weight parameters estimated by the training model parameter estimation method;
for each group of input preprocessed sampling data, counting the total data bit number as m bits, wherein m=d+b, and b is the bit number of a preset type mark; for each set of inputs, record x i The ith bit value of the input data after the addition of the preset type mark; the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, and each layer has s l The number of the activation items is equal to or more than 1 and equal to or less than L, and the number of the activation items of the last layer is consistent with the expected total classification number K, namely s 1 =m,s L =k; the weight parameters of the training transfer from the jth position of the layer I to the (l+1) th position of the training model are as follows
Figure FDA0003778489230000011
1≤j≤s l ,1≤i≤s l+1
First layer ith bit input value
Figure FDA0003778489230000021
Second layer ith intermediate training value +.>
Figure FDA0003778489230000022
The intermediate training value of the ith bit of the layer is +.>
Figure FDA0003778489230000023
Has the following components
Figure FDA0003778489230000024
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure FDA0003778489230000025
for the weight parameter vector when all intermediate training values of the first layer are transferred to the ith training value of the first layer (1) layer, the superscript T is matrix transposition, and the +.>
Figure FDA0003778489230000026
X l For the model layer I intermediate training value, +.>
Figure FDA0003778489230000027
The final layer output value is obtained after the layer-by-layer calculation
Figure FDA0003778489230000028
S4: outputting the classification result, namely the data label h, when
Figure FDA0003778489230000029
When taking h i =0; when->
Figure FDA00037784892300000210
When taking h i =1; when (when)
Figure FDA00037784892300000211
When the data is abnormal, the data does not belong to the trained data category or the classification model needs to be retrained; wherein K is 1 0 probability error range, K 2 A probability error range of 1; classification result h= [ h ] 1 ,h 2 ,…,h K ];
S5: labeling and archiving the data, labeling the original data unit corresponding to the sampled data by using the classification result h, and archiving and storing the labeled original data unit.
2. The method for labeling government affair data according to claim 1, wherein the training model self-learning estimation in step S0 comprises a neural network training model parameter estimation method, comprising the steps of:
s01: acquiring sample data, recording the total number of samples as N, and recording the data label of the nth sample
Figure FDA00037784892300000212
It is known that when the sample belongs to the kth class, -/->
Figure FDA00037784892300000213
Otherwise->
Figure FDA00037784892300000214
Wherein N is more than or equal to 1 and less than or equal to N, and K is more than or equal to 1 and less than or equal to K;
s02: sample data preprocessing, including sample data sampling and adding a preset type mark, wherein the sample data sampling includes intercepting the previous continuous d bits of a sample data unit; the sample data unit comprises any one or more of data files, documents, databases, data tables and data table similar fields; the adding of the preset type mark comprises adding a plurality of bits of preset type identification codes in front of the sampling number;
s03: sample training, calculating training values of each layer of the neural network model
For each group of input preprocessed sample sampling data, recording the data bit number as m bits, wherein m=d+b, and b is the bit number of a preset type mark; x is x i The ith bit value of the input data after the addition of the preset type mark; the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, and each layer has s l The number of the activation items is equal to or more than 1 and equal to or less than L, and the number of the activation items of the last layer is consistent with the expected total classification number K, namely s 1 =m,s L =k; the weight parameters of the training transfer from the jth position of the layer I to the (l+1) th position of the training model are as follows
Figure FDA0003778489230000034
Figure FDA0003778489230000035
Inputting sample data and a weight parameter initial value;
Figure FDA0003778489230000031
calculating the ith intermediate value of the first layer as
Figure FDA0003778489230000036
Has the following components
Figure FDA0003778489230000032
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure FDA0003778489230000037
for the weight parameter vector when all intermediate training values of the first layer are transferred to the ith training value of the first layer (1) layer, the superscript T is matrix transposition, and the +.>
Figure FDA0003778489230000033
X l For the model layer I intermediate training value, +.>
Figure FDA0003778489230000041
Calculating layer by layer until the final layer output value is obtained
Figure FDA0003778489230000043
Recording the training result of the nth sample
Figure FDA0003778489230000044
Marked as->
Figure FDA0003778489230000045
1≤n≤N;
Repeating the above step S33 for each set of input preprocessed sample data;
s04: calculating a cost function, calculating a cost function J (Θ) for all training results and weight parameters, Θ being all
Figure FDA0003778489230000046
Is a collection of (3);
Figure FDA0003778489230000042
wherein lambda is a deviation penalty parameter, lambda is more than or equal to 0 and less than or equal to 1;
s05: optimizing weight parameters for all weight parameters
Figure FDA0003778489230000047
Optimizing J (Θ) so that J (Θ) is minimized, i.e
Θ=minimizeJ(Θ)
The optimizing comprises automatic optimizing or weight parameter combination traversing optimizing of general machine learning software; the weight parameter combination traversal optimization includes, for each
Figure FDA0003778489230000048
Repeatedly executing steps S03 and S04 one by one within the range of values according to the designated step length, and taking J (Θ) minimum +.>
Figure FDA0003778489230000049
As a result of the optimizing.
3. The government data labeling system is characterized by at least comprising six modules which are sequentially connected with parameter estimation, data acquisition, data preprocessing, data analysis, label output and data labeling, wherein the six modules sequentially execute the steps S0, S1, S2, S3, S4 and S5 of claim 1 respectively.
4. A government data labelling system according to claim 3, wherein the parameter estimation module includes five modules for obtaining sample data, preprocessing sample data, training sample, calculating cost function, optimizing weight parameter, and the five modules sequentially execute steps S01, S02, S03, S04, S05 of claim 2, respectively.
CN202210922987.7A 2022-08-02 2022-08-02 Government affair data labeling system and method Active CN115357769B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210922987.7A CN115357769B (en) 2022-08-02 2022-08-02 Government affair data labeling system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210922987.7A CN115357769B (en) 2022-08-02 2022-08-02 Government affair data labeling system and method

Publications (2)

Publication Number Publication Date
CN115357769A CN115357769A (en) 2022-11-18
CN115357769B true CN115357769B (en) 2023-07-04

Family

ID=84033026

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210922987.7A Active CN115357769B (en) 2022-08-02 2022-08-02 Government affair data labeling system and method

Country Status (1)

Country Link
CN (1) CN115357769B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110756A (en) * 2019-04-09 2019-08-09 北京中科智营科技发展有限公司 A kind of data classification optimization method and optimization device
CN113505222A (en) * 2021-06-21 2021-10-15 山东师范大学 Government affair text classification method and system based on text circulation neural network
CN114091472A (en) * 2022-01-20 2022-02-25 北京零点远景网络科技有限公司 Training method of multi-label classification model

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060195782A1 (en) * 2005-02-28 2006-08-31 Microsoft Corporation Method and system for classifying and displaying tables of information
US11817993B2 (en) * 2015-01-27 2023-11-14 Dell Products L.P. System for decomposing events and unstructured data
US11334795B2 (en) * 2020-03-14 2022-05-17 DataRobot, Inc. Automated and adaptive design and training of neural networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110756A (en) * 2019-04-09 2019-08-09 北京中科智营科技发展有限公司 A kind of data classification optimization method and optimization device
CN113505222A (en) * 2021-06-21 2021-10-15 山东师范大学 Government affair text classification method and system based on text circulation neural network
CN114091472A (en) * 2022-01-20 2022-02-25 北京零点远景网络科技有限公司 Training method of multi-label classification model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向政务领域知识抽取的研究与应用;胡甜甜;中国优秀硕士学位论文全文数据库 社会科学I辑;第G110-2页 *

Also Published As

Publication number Publication date
CN115357769A (en) 2022-11-18

Similar Documents

Publication Publication Date Title
CN108573031A (en) A kind of complaint sorting technique and system based on content
CN112163424A (en) Data labeling method, device, equipment and medium
US11886820B2 (en) System and method for machine-learning based extraction of information from documents
AU2032302A (en) System for archiving electronic images of test question responses
CN115547466B (en) Medical institution registration and review system and method based on big data
CN115659226A (en) Data processing system for acquiring APP label
CN115293131A (en) Data matching method, device, equipment and storage medium
CN112181490A (en) Method, device, equipment and medium for identifying function category in function point evaluation method
CN115357769B (en) Government affair data labeling system and method
Lam et al. Evaluating classifiers by means of test data with noisy labels
CN116975738A (en) Polynomial naive Bayesian classification method for question intent recognition
CN112685374B (en) Log classification method and device and electronic equipment
Nagy et al. Interactive visual pattern recognition
CN116434273A (en) Multi-label prediction method and system based on single positive label
CN113222471B (en) Asset wind control method and device based on new media data
CN114996400A (en) Referee document processing method and device, electronic equipment and storage medium
CN113988064A (en) Semi-automatic entity labeling monitoring method
CN112100373A (en) Contract text analysis method and system based on deep neural network
CN114118098A (en) Contract review method, equipment and storage medium based on element extraction
CN113504865A (en) Work order label adding method, device, equipment and storage medium
CN113254612A (en) Knowledge question-answering processing method, device, equipment and storage medium
CN114037154A (en) Method and system for predicting scientific and technological achievement number and theme based on attention characteristics
CN117332761B (en) PDF document intelligent identification marking system
CN113297845B (en) Resume block classification method based on multi-level bidirectional circulation neural network
CN112836047B (en) Electronic medical record text data enhancement method based on sentence semantic replacement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant