CN115357769A - Government affair data labeling system and method - Google Patents

Government affair data labeling system and method Download PDF

Info

Publication number
CN115357769A
CN115357769A CN202210922987.7A CN202210922987A CN115357769A CN 115357769 A CN115357769 A CN 115357769A CN 202210922987 A CN202210922987 A CN 202210922987A CN 115357769 A CN115357769 A CN 115357769A
Authority
CN
China
Prior art keywords
data
layer
training
value
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210922987.7A
Other languages
Chinese (zh)
Other versions
CN115357769B (en
Inventor
严洪涛
张军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Zhiwei Digital Technology Co ltd
Original Assignee
Wuxi Zhiwei Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Zhiwei Digital Technology Co ltd filed Critical Wuxi Zhiwei Digital Technology Co ltd
Priority to CN202210922987.7A priority Critical patent/CN115357769B/en
Publication of CN115357769A publication Critical patent/CN115357769A/en
Application granted granted Critical
Publication of CN115357769B publication Critical patent/CN115357769B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a government affair data labeling system and method, which have the advantages that along with the development of the Internet of things and a big data technology, the government affair data information quantity is larger and larger, the data formats and types formed by different business systems are different, and the labeling can hardly be completed by manual identification. According to the invention, the original data are sampled, the preset type marks are added, then the data are input into the training model, and the data categories are output after being classified by the training model, so that the automatic classification and identification of mass data can be realized, and a foundation is established for the subsequent data retrieval application. The method of the invention is also suitable for blind classification, blind identification and data type subdivision of other types of data information or other types of service systems.

Description

Government affair data labeling system and method
Technical Field
The invention relates to the technical field of Internet of things and big data, in particular to a government affair data labeling system and method.
Background
With the development of the internet of things and big data technology, the data types are more and more, the data volume is larger and larger, and the automatic classification of unknown data is more and more important. The traditional data classification mainly carries out standardized processing on data in the modes of interface butt joint, database butt joint and manual identification, the method is low in efficiency, the invasion to the original business system is large, the whole-disk unified standardization is difficult to achieve, all the business systems are often independent and standard, temporary conversion is needed when butt joint is needed, and effective global data management cannot be carried out.
For government affair data, the actual types of the government affair data are not too many, but because each government affair information system is independently developed, the same government affair data are marked in different places, different periods and different government affair information systems to have different signs, different standards and different formats, for example, the field name of the A system to the ID number is sfz, the B system is user ID, the C system is old ID number and the like, for example, the A system is represented by year-month-day, the B system is represented by year, month, day, and the C system is represented by year, month, day, and the like, and the uniform classification marks are lacked; when the system D needs related information, unless the representation modes of different systems are accurately known, only useful information exists by removing the kernel one by one, and whether similar information exists in different systems cannot be determined, so that a blind identification and automatic identification means for unknown data is lacked. In addition, with the development of the information technology, mass data lack identification, the workload of completely depending on manual identification is huge, and the complete unified standard is difficult to achieve.
Disclosure of Invention
In view of the above, the present invention provides a system and a method for labeling government affair data, which can effectively solve the above-mentioned problems in the prior art.
The government affair data labeling system and method are characterized in that original data are sampled, added with preset type marks and then input into a training model, and data categories of the original data are output after being classified by the training model; the data label comprises a data category output by the training model; the raw data sampling comprises intercepting a plurality of paragraphs of raw data; the preset types comprise numbers, chinese characters, english characters, a mixture of numbers and characters, pictures, videos, texts, other types and the like, and the other types are all other types except the numbers, the Chinese characters, the English characters, the mixture of numbers and characters, the pictures, the videos and the text types; the training model comprises a neural network model, and the government affair data labeling method comprises the following steps:
s0: the method comprises the following steps of training model weight parameter estimation, including experience estimation or training model self-learning estimation, wherein the experience estimation includes the manual estimation of each weight parameter value of a model according to personal experience; the step is completed before the step S1 is executed, but the step does not need to be executed each time before the step S1 is executed, and the subsequent data labeling work can be supported for multiple times after the weight parameter estimation is completed once, and only a proper weight parameter estimation value is needed before the data labeling work is executed;
s1: data acquisition, namely collecting service system data to acquire various types of data, wherein the service system data comprises streaming or non-streaming data, structured or non-structured data, document data, internet data and the like;
s2: data preprocessing, including sampling original data and adding a preset type mark, wherein the original data sampling comprises intercepting front continuous d bits (bit) of an original data unit; the original data unit comprises any one or more of data files, documents, databases, data tables, data table homogeneous fields (possibly unknown categories) and the like; the additional preset type mark comprises a plurality of bits of preset type identification codes which are added in front of the sampling number, and for example, the preset 8 types can be represented by 3 binary numbers; since the number of bytes of a general data file is KB (Byte), even MB, or even more, the most accurate feature information can be obtained by performing feature recognition on all data of a file, but the calculation amount and the consumed time may be an astronomical number. Therefore, the d value is determined by considering both the calculation amount and the calculation efficiency and the effectiveness of feature extraction, and is not suitable for being too long or too short, and if the calculation amount is too large, the features of the original data unit may not be completely covered due to the short length; although a smaller value of d may represent a data characteristic of a portion of the original data unit, it may be difficult to extract a valid characteristic for data items or image data having a longer format field level;
s3: data analysis, namely calculating classification information of each group of sampling data according to the weight parameters estimated by the training model parameter estimation method;
for each group of input preprocessed sampling data, the total data bit number is m bits, m = d + b, wherein b is the bit number of a preset type mark; for each set of inputs, note x i Taking binary 0 or 1,1 ≤ i ≤ m (i is only for the input data) for the ith bit value of the input data with the preset type mark; the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, the output layer of the neural network training model is not calculated into the total number of layers, and each layer has s l 1 is more than or equal to L and less than or equal to L, the number of the activation items at the last layer is consistent with the expected total classification number K, namely s 1 =m,s L K (= K); the weight parameter of the j th bit of the training model to the i th bit of the l +1 th layer is
Figure BDA0003778489240000031
1≤j≤s l ,1≤i≤s l+1
Ith bit input value of first layer
Figure BDA0003778489240000032
Second layer ith intermediate training value
Figure BDA0003778489240000033
The ith intermediate training value of the l layer is
Figure BDA0003778489240000034
1≤i≤s l Is provided with
Figure BDA0003778489240000035
Wherein the content of the first and second substances,
Figure BDA0003778489240000036
the superscript T is the matrix transposition for the weight parameter vector when all the intermediate training values of the l layer are transferred to the ith training value of the l +1 layer,
Figure BDA0003778489240000037
X l is the model first layer middle training value, or called middle quantity,
Figure BDA0003778489240000041
calculating layer by layer to obtain the final output value
Figure BDA0003778489240000042
1≤i≤K;
S4: output the classification result, i.e. data label h when
Figure BDA0003778489240000043
When it is, take h i =0; when in use
Figure BDA0003778489240000044
Figure BDA0003778489240000045
When it is, take h i =1; when in use
Figure BDA0003778489240000046
When the data is abnormal, the data does not belong to the trained data category or the classification model needs to be retrained; wherein K 1 Is 0 probability error range, K 2 Is 1 probability error range; classification result h = [ h ] 1 ,h 2 ,…,h K ]The classification result is also called a subdivision data type relative to a preset type;
s5: and labeling and archiving the data, and labeling the original data unit corresponding to the sampling data by using the classification result h and then archiving and storing.
Repeating the steps S1 to S5 for each group of original data units, all the original data can be labeled.
Preferably, the training model self-learning estimation in step S0 includes a neural network training model parameter estimation method, which includes the following steps:
s01: acquiring sample data, counting the total number of samples as N, and recording the data label of the nth sample
Figure BDA0003778489240000047
It is known that, when the sample belongs to the k-th class,
Figure BDA0003778489240000048
otherwise
Figure BDA0003778489240000049
Wherein N is more than or equal to 1 and less than or equal to N, and K is more than or equal to 1 and less than or equal to K;
s02: sample data preprocessing, including sample data sampling and adding a preset type mark, wherein the sample data sampling includes intercepting the front continuous d bits of a sample data unit; the sample data unit comprises any one or more of a data file, a document, a database, a data table homogeneous field (possibly unknown category) and the like; the additional preset type mark comprises a plurality of additional preset type identification codes in front of the sampling number;
s03: training samples, calculating the training value of each layer of the neural network model
For each group of input preprocessed sample sampling data, recording the data bit number as m bits, wherein m = d + b, and b is the bit number marked by a preset type; x is the number of i Taking binary 0 or 1,1 ≤ i ≤ m (i is only for the input data) for the ith bit value of the input data with the preset type mark; the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, the total number of layers is not counted in the output layer, and each layer has s l 1 is more than or equal to L and less than or equal to L, the number of the activation items at the last layer is consistent with the expected total classification number K, namely s 1 =m,s L = K; the weight parameter of the jth bit of the training model from the jth bit of the l layer to the ith bit of the l +1 layer is
Figure BDA0003778489240000051
1≤j≤s l ,1≤i≤s l+1
Inputting sample data and an initial value of a weight parameter;
Figure BDA0003778489240000052
calculating the ith bit intermediate value of the l layer as
Figure BDA0003778489240000053
1≤i≤s l Is provided with
Figure BDA0003778489240000054
Wherein the content of the first and second substances,
Figure BDA0003778489240000055
the superscript T is the matrix transposition for the weight parameter vector when all the intermediate training values of the l layer are transferred to the ith training value of the l +1 layer,
Figure BDA0003778489240000056
X l is the model first layer middle training value, or called middle quantity,
Figure BDA0003778489240000057
calculating layer by layer until the final layer of output value is obtained
Figure BDA0003778489240000058
I is more than or equal to 1 and less than or equal to K; recording the training result of the nth sample
Figure BDA0003778489240000059
Marking as
Figure BDA00037784892400000510
1≤n≤N;
Repeating the above step S33 for each set of input preprocessed sample data;
s04: calculating a cost function, calculating a cost function J (theta) for all training results and weight parameters, theta being all
Figure BDA0003778489240000061
A set of (a);
Figure BDA0003778489240000062
wherein, lambda is a deviation punishment parameter, lambda is more than or equal to 0 and less than or equal to 1, lambda is an empirical parameter and is generally a small quantity; the cost function is essentially a sum of positive bias expectation, negative bias expectation and bias penalty terms of probability;
s05: weight parameter optimization for all weight parameters
Figure BDA0003778489240000063
Optimizing J (Θ) so that J (Θ) is minimal, i.e.
Θ=minimizeJ(Θ)
The optimization comprises automatic optimization of general machine learning software or traversal optimization of weight parameter combination; the weight parameter combination traversal optimization comprises for each
Figure BDA0003778489240000064
Taking values one by one in the value range according to the specified step length, repeatedly executing the steps S03 and S04, and taking the minimum J (theta)
Figure BDA0003778489240000065
As an optimization result.
In a second aspect, the government affair data labeling system is characterized by at least comprising six modules of parameter estimation, data acquisition, data preprocessing, data analysis, label output and data labeling which are sequentially connected, wherein the six modules sequentially and respectively execute steps S0, S1, S2, S3, S4 and S5. The government data tagging system is essentially a computer system.
Further, the parameter estimation module comprises five modules of sample data acquisition, sample data preprocessing, sample training, cost function calculation and weight parameter optimization, and the five modules respectively execute the steps S01, S02, S03, S04 and S05 in sequence.
The invention has the advantages and beneficial effects that: the government affair data labeling system and method provided by the invention provide an automatic classification and identification method for unknown categories and mass government affair data, so that a large amount of manual identification workload can be reduced, and even a large amount of identification work which cannot be completed manually can be completed. After the identification is finished, a corresponding data type label is marked on each original data, so that the subsequent data retrieval and professional application requirements can be facilitated. The method of the invention is also suitable for blind classification, blind identification and data type subdivision of other types of data information or other types of service systems.
Drawings
FIG. 1 is a flow chart of a method of labeling government data;
FIG. 2 is a flow chart of a method for weight parameter optimization estimation;
FIG. 3 is a schematic block diagram of a neural network training model;
fig. 4 is a schematic block diagram of a government affairs data tagging system.
Detailed Description
The following description of the embodiments of the present invention will be made with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
Example 1: government affair data labeling method
As shown in fig. 1, the system and method for labeling government affair data is characterized in that original data are sampled, added with a preset type label and then input into a training model, and data types of the original data are output after being classified by the training model; the data label comprises a data category output by the training model; the raw data sampling comprises intercepting a plurality of paragraphs of raw data; the preset types comprise numbers, chinese characters, english characters, a mixture of numbers and characters, pictures, videos, texts, other types and the like, and the other types are all other types except the numbers, the Chinese characters, the English characters, the mixture of numbers and characters, the pictures, the videos and the text types; the training model comprises a neural network model, and the government affair data tagging method comprises the following steps:
s0: training model weight parameter estimation, including empirical estimation or training model self-learning estimation, wherein the empirical estimation includes manually estimating each weight parameter value of the model according to personal experience; in the embodiment, self-learning estimation is carried out through a neural network training model during weight parameter estimation, and an initial value during estimation can adopt an empirical value, so that the optimization progress can be accelerated;
s1: data acquisition, namely collecting service system data to acquire various types of data, wherein the service system data comprises streaming or non-streaming data, structured or non-structured data, document data, internet data and the like;
s2: data preprocessing, including sampling original data and adding a preset type mark, wherein the original data sampling comprises intercepting front continuous d bits (bit) of an original data unit; the original data unit comprises any one or more of data files, documents, databases, data tables, data table homogeneous fields (possibly unknown categories) and the like; the additional preset type mark comprises a plurality of additional preset type identification codes in front of the sampling number; since the number of bytes of a general data file is KB (Byte), even MB, or even more, the most accurate feature information can be obtained by performing feature recognition on all data of a file, but the calculation amount and the consumed time may be an astronomical number. Therefore, the d value is determined by considering both the calculation amount and the calculation efficiency and the effectiveness of feature extraction, which is not suitable for too long or too short, and if the d value is too long, the calculation amount is too large, and if the d value is short, the features of the original data unit cannot be completely covered; although a smaller value of d may represent a data characteristic of a portion of the original data unit, it may be difficult to extract a valid characteristic for data items or image data having a longer format field level; the host computer of the embodiment adopts a workstation or a server, has stronger computing capability, generally takes d =800bit, presets general types such as numbers, chinese characters, english characters, mixed numbers and characters, pictures, videos, texts, other types and the like, and sequentially and respectively uses 3-bit binary numbers 000, 001, 010, 011, 100, 101, 110 and 111 for identification;
s3: data analysis, namely calculating classification information of each group of sampling data according to the weight parameters estimated by the training model parameter estimation method;
for each set of input pre-processed sample data, the total data bit number is m bits, m = d + b, where b is the bit number of the preset type flag, and this embodiment b =3, d =800, m =803; for each set of inputs, note x i Taking binary 0 or 1,1 ≤ i ≤ m (i is only for the input data) for the ith bit value of the input data with the preset type mark; the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, the output layer of the neural network training model is not calculated into the total number of layers, and each layer has s l 1 ≦ L ≦ L, and the number of last layer activation items is consistent with the desired total number of classes K, i.e. s 1 =m,s L K (= K); this embodiment employs a 5-layer neural network training model (without calculating the output layer), i.e., L =5, and s 1 =s 2 =s 3 =s 4 =m=803,s 5 = K =20; the weight parameter of the j th bit of the training model to the i th bit of the l +1 th layer is
Figure BDA0003778489240000091
1≤j≤s l ,1≤i≤s l+1
First layer ith bit input value
Figure BDA0003778489240000092
Second layer ith intermediate training value
Figure BDA0003778489240000093
The ith intermediate training value of the l layer is
Figure BDA0003778489240000094
1≤i≤s l Is provided with
Figure BDA0003778489240000095
Wherein,
Figure BDA0003778489240000096
For the weight parameter vector when all the middle training values of the l-th layer are transferred to the i-th training value of the l + 1-th layer, the superscript T is matrix transposition,
Figure BDA0003778489240000097
X l is the model first layer middle training value, or called middle quantity,
Figure BDA0003778489240000098
calculating layer by layer to obtain the final output value
Figure BDA0003778489240000099
1≤i≤K;
S4: output the classification result, i.e. data label h when
Figure BDA00037784892400000910
When it is, take h i =0; when in use
Figure BDA00037784892400000911
Figure BDA00037784892400000912
When it is, take h i =1; when in use
Figure BDA00037784892400000913
When the data is abnormal, the data does not belong to the trained data category or the classification model needs to be retrained; wherein K 1 Is 0 probability error range, K 2 Is 1 probability error range; classification result h = [ h ] 1 ,h 2 ,…,h K ]The classification result is also called a subdivision data type relative to a preset type; if the preset digital type further comprises three categories of identity information, position information and statistical information after being subdivided, the subdivided data category mainly depends on possible classification requirements of actual data; this example takes K 1 =0.1,K 2 =0.9, the classification result h is a matrix of 1 × 20;
s5: and labeling and filing the data, labeling the original data units corresponding to the sampled data by using the classification result h, and filing and storing.
Repeating the steps S1 to S5 for each group of original data units, all the original data can be labeled.
Further, as shown in fig. 2 and fig. 3, the training model self-learning estimation in step S0 includes a neural network training model parameter estimation method, which includes the following steps:
s01: obtaining sample data, recording the total number of samples as N, and recording the data label of the nth sample
Figure BDA0003778489240000101
It is known that, when the sample belongs to the kth class,
Figure BDA0003778489240000102
otherwise
Figure BDA0003778489240000103
Wherein N is more than or equal to 1 and less than or equal to N, and K is more than or equal to 1 and less than or equal to K;
s02: sample data preprocessing, including sample data sampling and adding a preset type mark, wherein the sample data sampling includes intercepting the front continuous d bits of a sample data unit; the sample data unit comprises any one or more of a data file, a document, a database, a data table homogeneous field (possibly unknown category) and the like; the additional preset type mark comprises a plurality of bits of preset type identification codes which are added in front of the sampling number;
s03: training samples, calculating the training value of each layer of the neural network model
For each set of input preprocessed sample data, marking the data bit number as m bits, where m = d + b, where b is the bit number of the preset type flag, and this embodiment b =3, d =800, m =803; x is the number of i Taking binary 0 or 1, 1-i ≦ m (i is only for the input data) for the ith bit value of the input data with the preset type mark; the neural network training model has L training layersThe first layer is input layer, the last layer is connected with output layer, the output layer of the invention does not count total number of layers, each layer has s l 1 ≦ L ≦ L, and the number of last layer activation items is consistent with the desired total number of classes K, i.e. s 1 =m,s L K (= K); this embodiment employs a 5-layer neural network training model (without calculating the output layer), i.e., L =5, and s 1 =s 2 =s 3 =s 4 =m=803,s 5 = K =20; the weight parameter of the jth bit of the training model from the jth bit of the l layer to the ith bit of the l +1 layer is
Figure BDA0003778489240000111
1≤j≤s l ,1≤i≤s l+1
Inputting sample data and initial values of the weight parameters, wherein the initial values are 0.5 during the first test, and one group of weight parameter results of the previous test is used as the initial values in the subsequent debugging process, and the group of parameters is generally considered to be more reasonable, so that the test and debugging process can be accelerated;
Figure BDA0003778489240000112
calculating the ith bit intermediate value of the l layer as
Figure BDA0003778489240000113
1≤i≤s l Is provided with
Figure BDA0003778489240000114
Wherein the content of the first and second substances,
Figure BDA0003778489240000115
the superscript T is the matrix transposition for the weight parameter vector when all the intermediate training values of the l layer are transferred to the ith training value of the l +1 layer,
Figure BDA0003778489240000116
X l in the middle of the first layer of the modelThe training value, or intermediate amount,
Figure BDA0003778489240000117
calculating layer by layer until the final layer of output value is obtained
Figure BDA0003778489240000118
I is more than or equal to 1 and less than or equal to K; recording the training result of the nth sample
Figure BDA0003778489240000119
Marking as
Figure BDA00037784892400001110
1≤n≤N;
Repeating the above step S33 for each set of input preprocessed sample data;
s04: calculating a cost function, calculating a cost function J (theta) for all training results and weight parameters, theta being all
Figure BDA00037784892400001111
A set of (a);
Figure BDA0003778489240000121
wherein λ is a deviation penalty parameter, λ is greater than or equal to 0 and less than or equal to 1, λ is an empirical parameter, which is generally a small quantity, and the value of the embodiment is 0.001; the cost function is essentially a sum of positive bias expectation, negative bias expectation and bias penalty terms of probability;
in this embodiment, the total number of samples N =30 ten thousand, of which 20 ten thousand are used for model training and 10 ten thousand are used for model testing after training, and when the test result is unsatisfactory, the parameters are adjusted to perform training and testing again;
s05: weight parameter optimization, for all weight parameters
Figure BDA0003778489240000122
Optimizing J (Θ) such that J (Θ) is minimal, i.e.
Θ=minimizeJ(Θ)
The optimization comprises automatic optimization of general machine learning software or traversal optimization of weight parameter combination; the weight parameter combination traversal optimization comprises for each
Figure BDA0003778489240000123
Taking values one by one in the value range according to the specified step length, repeatedly executing the steps S03 and S04, and taking the minimum J (theta)
Figure BDA0003778489240000124
As an optimization result; in this embodiment, a general neural network training model software (such as a tensrflow open-source machine learning platform) is adopted, and the principle is shown in fig. 3 and 4, after the number of training layers and each layer of activation item are defined, the software can automatically perform machine learning and output the optimized weight parameter vector Θ and the value of the sample function J (Θ).
The weight parameter estimation of the embodiment carries out self-learning estimation through a neural network training model, sample data is divided into two parts in the actual estimation process, one part is used for training and estimating the weight parameter, and the other part is used for testing the availability of the estimation result; all sample data are known in advance, and generally when the test yield reaches more than 90%, the classification method and the classification system have practical usability; generally, manual analysis is needed for the test data with inconsistent automatic classification results and preset results, reasons are found accurately, and model parameters are adjusted if necessary; the reason that the classification result is inaccurate is that the sampling bit number is not enough, and the original data characteristics cannot be obviously represented; the reason for the inaccurate classification may be, on the one hand, that the number of sampling bits is not sufficient, and, on the other hand, that the raw data does not match the intended fine classification, or at least that there are uncertain disturbing terms in the sampling results of the raw data.
The present embodiment is generally not retrained after the test training is completed, because retraining may mean that the previous classification result needs to be reclassified, which requires case-specific analysis. To ensure the availability of the classification system, the present embodiment requires a test yield of 99% or more, and all parameters and weights are fixed. When special conditions are not classified, the data can be manually analyzed to find reasons, no special requirements are needed, other parameters are not changed except for temporary adjustment of sampling digits, and all the reasons which are unknown are classified into problem data.
Example 2: government affair data labeling system
As shown in fig. 1, a government affair data tagging system at least comprises six modules of parameter estimation, data acquisition, data preprocessing, data analysis, tag output and data tagging, which are sequentially connected, and the six modules sequentially and respectively execute steps S0, S1, S2, S3, S4 and S5. The government data tagging system is essentially a computer system.
Further, as shown in fig. 2, the parameter estimation module includes five modules, i.e., sample data acquisition, sample data preprocessing, sample training, cost function calculation, and weight parameter optimization, and the five modules respectively execute steps S01, S02, S03, S04, and S05 in sequence.
The working principle of the system is shown in fig. 4, sample data is adopted for model training, and after a training result is obtained, the model is applied to actual data.
With regard to the method and system for automatically classifying government affair data in embodiments 1 and 2, in practical applications, the estimation of general weight parameters is not adjusted any more, and especially for the general government affair data, the new model training is performed to estimate the weight parameters mainly aiming at the new general class of original data to be subdivided, which is different from the original government affair data in classes, and the subdivision criteria may also be different. At this time, not only the model training needs to be performed again, but also parameters such as the sampling bit number, the preset type, the expected total classification number, and the like need to be reset. The method of the invention can be applied to the automatic subdivision of other kinds of data information.
The basic principle of the invention is as follows: determining the expected number of data tags according to the total possible fine category number demand of the government affair data; presetting general data types according to possible informatization representation modes of the original data unit, such as numbers, chinese characters, english characters, mixed numbers and characters, pictures, videos, texts and the like; collecting the clearly classified original data as a training sample to train the model, and acquiring reasonable weight parameters; and classifying and identifying the original data units to be classified by using the trained model, and filing after labeling.
The application of the invention is as follows: data are classified through the model, so that labels can be normally classified and established, and the usability of the data is ensured; carrying out deep analysis on data which cannot be matched, and trying fuzzy matching; repeated data can be cleaned; filing dirty data or polluted data, feeding back a related service system, and trying to acquire standard service data for the second time; finally, data which still cannot be matched are filed for future reference; the background of the system can perform statistical analysis on the matched tag data, change monitoring and data interface service, and realize intelligent data retrieval and supervision. For data with an unsatisfactory classification result, on one hand, the data can be directly classified into problem data or classified into other classes, on the other hand, the data can be manually analyzed, and the model can be trained again when necessary; for other types of data information or data information generated by other types of service systems, the method is applied to further subdivide the data categories, mainly the model parameters need corresponding designs, such as sampling bit number, preset types, the number of expected total categories and other elements.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and amendments, including the selection method of parameters, the initialization method, etc., can be made without departing from the technical principle of the present invention, and these modifications and amendments should also be regarded as the protection scope of the present invention.

Claims (4)

1. A government affair data labeling method is characterized in that original data are sampled, preset type marks are added, then the original data are input into a training model, and data types of the original data are output after the original data are classified by the training model; the data labels comprise data categories output by the training model; the raw data sampling comprises intercepting a plurality of paragraphs of raw data; the preset types comprise numbers, chinese characters, english characters, numbers and character mixture, pictures, videos and texts; the training model comprises a neural network model, and the government affair data labeling method comprises the following steps:
s0: training model weight parameter estimation, including empirical estimation or training model self-learning estimation, wherein the empirical estimation includes manually estimating each weight parameter value of the model according to personal experience;
s1: collecting data, namely collecting business system data, wherein the business system data comprises streaming or non-streaming data, structured or unstructured data, document data and internet data;
s2: data preprocessing, including original data sampling and adding a preset type mark, wherein the original data sampling comprises intercepting front continuous d bits of an original data unit; the original data unit comprises any one or more of data files, documents, databases, data tables and data table similar fields; the additional preset type mark comprises a plurality of bits of preset type identification codes which are added in front of the sampling number;
s3: data analysis, namely calculating classification information of each group of sampling data according to the weight parameters estimated by the training model parameter estimation method;
for each group of input preprocessed sampling data, the total data bit number is m bits, m = d + b, wherein b is the bit number of a preset type mark; for each set of inputs, note x i The ith bit value of the input data with the preset type mark is added; the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, and each layer has s l 1 ≦ L ≦ L, and the number of last layer activation items is consistent with the desired total number of classes K, i.e. s 1 =m,s L K (= K); the weight parameter of the j th bit of the training model to the i th bit of the l +1 th layer is
Figure FDA0003778489230000011
1≤j≤s l ,1≤i≤s l+1
Ith bit input value of first layer
Figure FDA0003778489230000021
Second layer ith intermediate training value
Figure FDA0003778489230000022
The ith intermediate training value of the layer
Figure FDA0003778489230000023
Is provided with
Figure FDA0003778489230000024
Wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003778489230000025
for the weight parameter vector when all the middle training values of the l-th layer are transferred to the i-th training value of the l + 1-th layer, the superscript T is matrix transposition,
Figure FDA0003778489230000026
X l for the model layer i intermediate training values,
Figure FDA0003778489230000027
calculating layer by layer to obtain the final output value
Figure FDA0003778489230000028
S4: output the classification result, i.e. data label h, when
Figure FDA0003778489230000029
When it is, take h i =0; when in use
Figure FDA00037784892300000210
When it is, take h i =1; when the temperature is higher than the set temperature
Figure FDA00037784892300000211
When the data is abnormal, the data does not belong to the trained data category or the classification model needs to be retrained; wherein K 1 Is 0 probability error range, K 2 Is 1 probability error range; classification result h = [ h = 1 ,h 2 ,…,h K ];
S5: and labeling and archiving the data, and labeling the original data unit corresponding to the sampling data by using the classification result h and then archiving and storing.
2. The labeling method for government affairs data according to claim 1, wherein the training model self-learning estimation in step S0 comprises a neural network training model parameter estimation method, comprising the following steps:
s01: acquiring sample data, counting the total number of samples as N, and recording the data label of the nth sample
Figure FDA00037784892300000212
It is known that, when the sample belongs to the k-th class,
Figure FDA00037784892300000213
otherwise
Figure FDA00037784892300000214
Wherein N is more than or equal to 1 and less than or equal to N, and K is more than or equal to 1 and less than or equal to K;
s02: sample data preprocessing, including sample data sampling and attaching a preset type mark, wherein the sample data sampling comprises intercepting front continuous d bits of a sample data unit; the sample data unit comprises any one or more of data files, documents, databases, data tables and data table similar fields; the additional preset type mark comprises a plurality of bits of preset type identification codes which are added in front of the sampling number;
s03: training samples, calculating the training value of each layer of the neural network model
Recording data bits for each set of input pre-processed sample dataThe number is m bits, m = d + b, where b is the number of bits of the preset type mark; x is a radical of a fluorine atom i The ith bit value of the input data with the preset type mark is added; the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, and each layer has s l 1 ≦ L ≦ L, and the number of last layer activation items is consistent with the desired total number of classes K, i.e. s 1 =m,s L K (= K); the weight parameter of the jth bit of the training model from the jth bit of the l layer to the ith bit of the l +1 layer is
Figure FDA0003778489230000034
Figure FDA0003778489230000035
Inputting sample data and an initial value of a weight parameter;
Figure FDA0003778489230000031
calculating the ith bit intermediate value of the l layer as
Figure FDA0003778489230000036
Is provided with
Figure FDA0003778489230000032
Wherein the content of the first and second substances,
Figure FDA0003778489230000037
for the weight parameter vector when all the middle training values of the l-th layer are transferred to the i-th training value of the l + 1-th layer, the superscript T is matrix transposition,
Figure FDA0003778489230000033
X l for the model layer i intermediate training values,
Figure FDA0003778489230000041
calculating layer by layer until the last layer of output value is obtained
Figure FDA0003778489230000043
Recording the training result of the nth sample
Figure FDA0003778489230000044
Marking as
Figure FDA0003778489230000045
1≤n≤N;
Repeating the above step S33 for each set of input preprocessed sample data;
s04: calculating a cost function, calculating a cost function J (theta) for all training results and weight parameters, theta being all
Figure FDA0003778489230000046
A set of (a);
Figure FDA0003778489230000042
wherein, λ is a deviation punishment parameter, λ is more than or equal to 0 and less than or equal to 1;
s05: weight parameter optimization, for all weight parameters
Figure FDA0003778489230000047
Optimizing J (Θ) such that J (Θ) is minimal, i.e.
Θ=minimizeJ(Θ)
The optimizing comprises automatic optimizing of general machine learning software or traversing optimizing of weight parameter combination; the weight parameter combination traversal optimization comprises for each
Figure FDA0003778489230000048
Taking values one by one in a value range according to the specified step length, and repeatingExecuting steps S03 and S04, and taking J (theta) minimum
Figure FDA0003778489230000049
As a result of the optimization.
3. A government affair data labeling system at least comprises six modules of parameter estimation, data acquisition, data preprocessing, data analysis, label output and data labeling which are sequentially connected, wherein the six modules sequentially execute the steps S0, S1, S2, S3, S4 and S5 of claim 1 respectively.
4. The system according to claim 3, wherein the parameter estimation module comprises five modules of sample data acquisition, sample data preprocessing, sample training, cost function calculation and weight parameter optimization, and the five modules respectively execute the steps S01, S02, S03, S04 and S05 in sequence according to claim 2.
CN202210922987.7A 2022-08-02 2022-08-02 Government affair data labeling system and method Active CN115357769B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210922987.7A CN115357769B (en) 2022-08-02 2022-08-02 Government affair data labeling system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210922987.7A CN115357769B (en) 2022-08-02 2022-08-02 Government affair data labeling system and method

Publications (2)

Publication Number Publication Date
CN115357769A true CN115357769A (en) 2022-11-18
CN115357769B CN115357769B (en) 2023-07-04

Family

ID=84033026

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210922987.7A Active CN115357769B (en) 2022-08-02 2022-08-02 Government affair data labeling system and method

Country Status (1)

Country Link
CN (1) CN115357769B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060195782A1 (en) * 2005-02-28 2006-08-31 Microsoft Corporation Method and system for classifying and displaying tables of information
CN110110756A (en) * 2019-04-09 2019-08-09 北京中科智营科技发展有限公司 A kind of data classification optimization method and optimization device
US20210287089A1 (en) * 2020-03-14 2021-09-16 DataRobot, Inc. Automated and adaptive design and training of neural networks
US20210306200A1 (en) * 2015-01-27 2021-09-30 Moogsoft Inc. System for decomposing events and unstructured data
CN113505222A (en) * 2021-06-21 2021-10-15 山东师范大学 Government affair text classification method and system based on text circulation neural network
CN114091472A (en) * 2022-01-20 2022-02-25 北京零点远景网络科技有限公司 Training method of multi-label classification model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060195782A1 (en) * 2005-02-28 2006-08-31 Microsoft Corporation Method and system for classifying and displaying tables of information
US20210306200A1 (en) * 2015-01-27 2021-09-30 Moogsoft Inc. System for decomposing events and unstructured data
CN110110756A (en) * 2019-04-09 2019-08-09 北京中科智营科技发展有限公司 A kind of data classification optimization method and optimization device
US20210287089A1 (en) * 2020-03-14 2021-09-16 DataRobot, Inc. Automated and adaptive design and training of neural networks
CN113505222A (en) * 2021-06-21 2021-10-15 山东师范大学 Government affair text classification method and system based on text circulation neural network
CN114091472A (en) * 2022-01-20 2022-02-25 北京零点远景网络科技有限公司 Training method of multi-label classification model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡甜甜: "面向政务领域知识抽取的研究与应用", 中国优秀硕士学位论文全文数据库 社会科学I辑 *

Also Published As

Publication number Publication date
CN115357769B (en) 2023-07-04

Similar Documents

Publication Publication Date Title
CN109543084B (en) Method for establishing detection model of hidden sensitive text facing network social media
CN109165294B (en) Short text classification method based on Bayesian classification
CN111222305B (en) Information structuring method and device
Adams et al. Automating image matching, cataloging, and analysis for photo-identification research
CN108573031A (en) A kind of complaint sorting technique and system based on content
CN109492230B (en) Method for extracting insurance contract key information based on interested text field convolutional neural network
CN112163424A (en) Data labeling method, device, equipment and medium
CN112560478A (en) Chinese address RoBERTA-BilSTM-CRF coupling analysis method using semantic annotation
CN115293131B (en) Data matching method, device, equipment and storage medium
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN113515600B (en) Automatic calculation method for spatial analysis based on metadata
CN115759104B (en) Financial domain public opinion analysis method and system based on entity identification
CN114416979A (en) Text query method, text query equipment and storage medium
CN115659226A (en) Data processing system for acquiring APP label
CN114297987A (en) Document information extraction method and system based on text classification and reading understanding
CN110705384B (en) Vehicle re-identification method based on cross-domain migration enhanced representation
CN109783483A (en) A kind of method, apparatus of data preparation, computer storage medium and terminal
CN115357769B (en) Government affair data labeling system and method
CN111966640A (en) Document file identification method and system
CN116434273A (en) Multi-label prediction method and system based on single positive label
CN115599885A (en) Document full-text retrieval method and device, computer equipment, storage medium and product
CN114722183A (en) Knowledge pushing method and system for scientific research tasks
CN113254612A (en) Knowledge question-answering processing method, device, equipment and storage medium
CN112818215A (en) Product data processing method, device, equipment and storage medium
CN113297845B (en) Resume block classification method based on multi-level bidirectional circulation neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant