CN115357769A - Government affair data labeling system and method - Google Patents
Government affair data labeling system and method Download PDFInfo
- Publication number
- CN115357769A CN115357769A CN202210922987.7A CN202210922987A CN115357769A CN 115357769 A CN115357769 A CN 115357769A CN 202210922987 A CN202210922987 A CN 202210922987A CN 115357769 A CN115357769 A CN 115357769A
- Authority
- CN
- China
- Prior art keywords
- data
- layer
- training
- value
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a government affair data labeling system and method, which have the advantages that along with the development of the Internet of things and a big data technology, the government affair data information quantity is larger and larger, the data formats and types formed by different business systems are different, and the labeling can hardly be completed by manual identification. According to the invention, the original data are sampled, the preset type marks are added, then the data are input into the training model, and the data categories are output after being classified by the training model, so that the automatic classification and identification of mass data can be realized, and a foundation is established for the subsequent data retrieval application. The method of the invention is also suitable for blind classification, blind identification and data type subdivision of other types of data information or other types of service systems.
Description
Technical Field
The invention relates to the technical field of Internet of things and big data, in particular to a government affair data labeling system and method.
Background
With the development of the internet of things and big data technology, the data types are more and more, the data volume is larger and larger, and the automatic classification of unknown data is more and more important. The traditional data classification mainly carries out standardized processing on data in the modes of interface butt joint, database butt joint and manual identification, the method is low in efficiency, the invasion to the original business system is large, the whole-disk unified standardization is difficult to achieve, all the business systems are often independent and standard, temporary conversion is needed when butt joint is needed, and effective global data management cannot be carried out.
For government affair data, the actual types of the government affair data are not too many, but because each government affair information system is independently developed, the same government affair data are marked in different places, different periods and different government affair information systems to have different signs, different standards and different formats, for example, the field name of the A system to the ID number is sfz, the B system is user ID, the C system is old ID number and the like, for example, the A system is represented by year-month-day, the B system is represented by year, month, day, and the C system is represented by year, month, day, and the like, and the uniform classification marks are lacked; when the system D needs related information, unless the representation modes of different systems are accurately known, only useful information exists by removing the kernel one by one, and whether similar information exists in different systems cannot be determined, so that a blind identification and automatic identification means for unknown data is lacked. In addition, with the development of the information technology, mass data lack identification, the workload of completely depending on manual identification is huge, and the complete unified standard is difficult to achieve.
Disclosure of Invention
In view of the above, the present invention provides a system and a method for labeling government affair data, which can effectively solve the above-mentioned problems in the prior art.
The government affair data labeling system and method are characterized in that original data are sampled, added with preset type marks and then input into a training model, and data categories of the original data are output after being classified by the training model; the data label comprises a data category output by the training model; the raw data sampling comprises intercepting a plurality of paragraphs of raw data; the preset types comprise numbers, chinese characters, english characters, a mixture of numbers and characters, pictures, videos, texts, other types and the like, and the other types are all other types except the numbers, the Chinese characters, the English characters, the mixture of numbers and characters, the pictures, the videos and the text types; the training model comprises a neural network model, and the government affair data labeling method comprises the following steps:
s0: the method comprises the following steps of training model weight parameter estimation, including experience estimation or training model self-learning estimation, wherein the experience estimation includes the manual estimation of each weight parameter value of a model according to personal experience; the step is completed before the step S1 is executed, but the step does not need to be executed each time before the step S1 is executed, and the subsequent data labeling work can be supported for multiple times after the weight parameter estimation is completed once, and only a proper weight parameter estimation value is needed before the data labeling work is executed;
s1: data acquisition, namely collecting service system data to acquire various types of data, wherein the service system data comprises streaming or non-streaming data, structured or non-structured data, document data, internet data and the like;
s2: data preprocessing, including sampling original data and adding a preset type mark, wherein the original data sampling comprises intercepting front continuous d bits (bit) of an original data unit; the original data unit comprises any one or more of data files, documents, databases, data tables, data table homogeneous fields (possibly unknown categories) and the like; the additional preset type mark comprises a plurality of bits of preset type identification codes which are added in front of the sampling number, and for example, the preset 8 types can be represented by 3 binary numbers; since the number of bytes of a general data file is KB (Byte), even MB, or even more, the most accurate feature information can be obtained by performing feature recognition on all data of a file, but the calculation amount and the consumed time may be an astronomical number. Therefore, the d value is determined by considering both the calculation amount and the calculation efficiency and the effectiveness of feature extraction, and is not suitable for being too long or too short, and if the calculation amount is too large, the features of the original data unit may not be completely covered due to the short length; although a smaller value of d may represent a data characteristic of a portion of the original data unit, it may be difficult to extract a valid characteristic for data items or image data having a longer format field level;
s3: data analysis, namely calculating classification information of each group of sampling data according to the weight parameters estimated by the training model parameter estimation method;
for each group of input preprocessed sampling data, the total data bit number is m bits, m = d + b, wherein b is the bit number of a preset type mark; for each set of inputs, note x i Taking binary 0 or 1,1 ≤ i ≤ m (i is only for the input data) for the ith bit value of the input data with the preset type mark; the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, the output layer of the neural network training model is not calculated into the total number of layers, and each layer has s l 1 is more than or equal to L and less than or equal to L, the number of the activation items at the last layer is consistent with the expected total classification number K, namely s 1 =m,s L K (= K); the weight parameter of the j th bit of the training model to the i th bit of the l +1 th layer is1≤j≤s l ,1≤i≤s l+1 ;
Ith bit input value of first layerSecond layer ith intermediate training valueThe ith intermediate training value of the l layer is1≤i≤s l Is provided with
Wherein the content of the first and second substances,the superscript T is the matrix transposition for the weight parameter vector when all the intermediate training values of the l layer are transferred to the ith training value of the l +1 layer,X l is the model first layer middle training value, or called middle quantity,
S4: output the classification result, i.e. data label h whenWhen it is, take h i =0; when in use When it is, take h i =1; when in useWhen the data is abnormal, the data does not belong to the trained data category or the classification model needs to be retrained; wherein K 1 Is 0 probability error range, K 2 Is 1 probability error range; classification result h = [ h ] 1 ,h 2 ,…,h K ]The classification result is also called a subdivision data type relative to a preset type;
s5: and labeling and archiving the data, and labeling the original data unit corresponding to the sampling data by using the classification result h and then archiving and storing.
Repeating the steps S1 to S5 for each group of original data units, all the original data can be labeled.
Preferably, the training model self-learning estimation in step S0 includes a neural network training model parameter estimation method, which includes the following steps:
s01: acquiring sample data, counting the total number of samples as N, and recording the data label of the nth sampleIt is known that, when the sample belongs to the k-th class,otherwiseWherein N is more than or equal to 1 and less than or equal to N, and K is more than or equal to 1 and less than or equal to K;
s02: sample data preprocessing, including sample data sampling and adding a preset type mark, wherein the sample data sampling includes intercepting the front continuous d bits of a sample data unit; the sample data unit comprises any one or more of a data file, a document, a database, a data table homogeneous field (possibly unknown category) and the like; the additional preset type mark comprises a plurality of additional preset type identification codes in front of the sampling number;
s03: training samples, calculating the training value of each layer of the neural network model
For each group of input preprocessed sample sampling data, recording the data bit number as m bits, wherein m = d + b, and b is the bit number marked by a preset type; x is the number of i Taking binary 0 or 1,1 ≤ i ≤ m (i is only for the input data) for the ith bit value of the input data with the preset type mark; the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, the total number of layers is not counted in the output layer, and each layer has s l 1 is more than or equal to L and less than or equal to L, the number of the activation items at the last layer is consistent with the expected total classification number K, namely s 1 =m,s L = K; the weight parameter of the jth bit of the training model from the jth bit of the l layer to the ith bit of the l +1 layer is1≤j≤s l ,1≤i≤s l+1 ;
Inputting sample data and an initial value of a weight parameter;
Wherein the content of the first and second substances,the superscript T is the matrix transposition for the weight parameter vector when all the intermediate training values of the l layer are transferred to the ith training value of the l +1 layer,X l is the model first layer middle training value, or called middle quantity,
calculating layer by layer until the final layer of output value is obtainedI is more than or equal to 1 and less than or equal to K; recording the training result of the nth sampleMarking as1≤n≤N;
Repeating the above step S33 for each set of input preprocessed sample data;
s04: calculating a cost function, calculating a cost function J (theta) for all training results and weight parameters, theta being allA set of (a);
wherein, lambda is a deviation punishment parameter, lambda is more than or equal to 0 and less than or equal to 1, lambda is an empirical parameter and is generally a small quantity; the cost function is essentially a sum of positive bias expectation, negative bias expectation and bias penalty terms of probability;
s05: weight parameter optimization for all weight parametersOptimizing J (Θ) so that J (Θ) is minimal, i.e.
Θ=minimizeJ(Θ)
The optimization comprises automatic optimization of general machine learning software or traversal optimization of weight parameter combination; the weight parameter combination traversal optimization comprises for eachTaking values one by one in the value range according to the specified step length, repeatedly executing the steps S03 and S04, and taking the minimum J (theta)As an optimization result.
In a second aspect, the government affair data labeling system is characterized by at least comprising six modules of parameter estimation, data acquisition, data preprocessing, data analysis, label output and data labeling which are sequentially connected, wherein the six modules sequentially and respectively execute steps S0, S1, S2, S3, S4 and S5. The government data tagging system is essentially a computer system.
Further, the parameter estimation module comprises five modules of sample data acquisition, sample data preprocessing, sample training, cost function calculation and weight parameter optimization, and the five modules respectively execute the steps S01, S02, S03, S04 and S05 in sequence.
The invention has the advantages and beneficial effects that: the government affair data labeling system and method provided by the invention provide an automatic classification and identification method for unknown categories and mass government affair data, so that a large amount of manual identification workload can be reduced, and even a large amount of identification work which cannot be completed manually can be completed. After the identification is finished, a corresponding data type label is marked on each original data, so that the subsequent data retrieval and professional application requirements can be facilitated. The method of the invention is also suitable for blind classification, blind identification and data type subdivision of other types of data information or other types of service systems.
Drawings
FIG. 1 is a flow chart of a method of labeling government data;
FIG. 2 is a flow chart of a method for weight parameter optimization estimation;
FIG. 3 is a schematic block diagram of a neural network training model;
fig. 4 is a schematic block diagram of a government affairs data tagging system.
Detailed Description
The following description of the embodiments of the present invention will be made with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
Example 1: government affair data labeling method
As shown in fig. 1, the system and method for labeling government affair data is characterized in that original data are sampled, added with a preset type label and then input into a training model, and data types of the original data are output after being classified by the training model; the data label comprises a data category output by the training model; the raw data sampling comprises intercepting a plurality of paragraphs of raw data; the preset types comprise numbers, chinese characters, english characters, a mixture of numbers and characters, pictures, videos, texts, other types and the like, and the other types are all other types except the numbers, the Chinese characters, the English characters, the mixture of numbers and characters, the pictures, the videos and the text types; the training model comprises a neural network model, and the government affair data tagging method comprises the following steps:
s0: training model weight parameter estimation, including empirical estimation or training model self-learning estimation, wherein the empirical estimation includes manually estimating each weight parameter value of the model according to personal experience; in the embodiment, self-learning estimation is carried out through a neural network training model during weight parameter estimation, and an initial value during estimation can adopt an empirical value, so that the optimization progress can be accelerated;
s1: data acquisition, namely collecting service system data to acquire various types of data, wherein the service system data comprises streaming or non-streaming data, structured or non-structured data, document data, internet data and the like;
s2: data preprocessing, including sampling original data and adding a preset type mark, wherein the original data sampling comprises intercepting front continuous d bits (bit) of an original data unit; the original data unit comprises any one or more of data files, documents, databases, data tables, data table homogeneous fields (possibly unknown categories) and the like; the additional preset type mark comprises a plurality of additional preset type identification codes in front of the sampling number; since the number of bytes of a general data file is KB (Byte), even MB, or even more, the most accurate feature information can be obtained by performing feature recognition on all data of a file, but the calculation amount and the consumed time may be an astronomical number. Therefore, the d value is determined by considering both the calculation amount and the calculation efficiency and the effectiveness of feature extraction, which is not suitable for too long or too short, and if the d value is too long, the calculation amount is too large, and if the d value is short, the features of the original data unit cannot be completely covered; although a smaller value of d may represent a data characteristic of a portion of the original data unit, it may be difficult to extract a valid characteristic for data items or image data having a longer format field level; the host computer of the embodiment adopts a workstation or a server, has stronger computing capability, generally takes d =800bit, presets general types such as numbers, chinese characters, english characters, mixed numbers and characters, pictures, videos, texts, other types and the like, and sequentially and respectively uses 3-bit binary numbers 000, 001, 010, 011, 100, 101, 110 and 111 for identification;
s3: data analysis, namely calculating classification information of each group of sampling data according to the weight parameters estimated by the training model parameter estimation method;
for each set of input pre-processed sample data, the total data bit number is m bits, m = d + b, where b is the bit number of the preset type flag, and this embodiment b =3, d =800, m =803; for each set of inputs, note x i Taking binary 0 or 1,1 ≤ i ≤ m (i is only for the input data) for the ith bit value of the input data with the preset type mark; the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, the output layer of the neural network training model is not calculated into the total number of layers, and each layer has s l 1 ≦ L ≦ L, and the number of last layer activation items is consistent with the desired total number of classes K, i.e. s 1 =m,s L K (= K); this embodiment employs a 5-layer neural network training model (without calculating the output layer), i.e., L =5, and s 1 =s 2 =s 3 =s 4 =m=803,s 5 = K =20; the weight parameter of the j th bit of the training model to the i th bit of the l +1 th layer is1≤j≤s l ,1≤i≤s l+1 ;
First layer ith bit input valueSecond layer ith intermediate training valueThe ith intermediate training value of the l layer is1≤i≤s l Is provided with
Wherein,For the weight parameter vector when all the middle training values of the l-th layer are transferred to the i-th training value of the l + 1-th layer, the superscript T is matrix transposition,X l is the model first layer middle training value, or called middle quantity,
S4: output the classification result, i.e. data label h whenWhen it is, take h i =0; when in use When it is, take h i =1; when in useWhen the data is abnormal, the data does not belong to the trained data category or the classification model needs to be retrained; wherein K 1 Is 0 probability error range, K 2 Is 1 probability error range; classification result h = [ h ] 1 ,h 2 ,…,h K ]The classification result is also called a subdivision data type relative to a preset type; if the preset digital type further comprises three categories of identity information, position information and statistical information after being subdivided, the subdivided data category mainly depends on possible classification requirements of actual data; this example takes K 1 =0.1,K 2 =0.9, the classification result h is a matrix of 1 × 20;
s5: and labeling and filing the data, labeling the original data units corresponding to the sampled data by using the classification result h, and filing and storing.
Repeating the steps S1 to S5 for each group of original data units, all the original data can be labeled.
Further, as shown in fig. 2 and fig. 3, the training model self-learning estimation in step S0 includes a neural network training model parameter estimation method, which includes the following steps:
s01: obtaining sample data, recording the total number of samples as N, and recording the data label of the nth sampleIt is known that, when the sample belongs to the kth class,otherwiseWherein N is more than or equal to 1 and less than or equal to N, and K is more than or equal to 1 and less than or equal to K;
s02: sample data preprocessing, including sample data sampling and adding a preset type mark, wherein the sample data sampling includes intercepting the front continuous d bits of a sample data unit; the sample data unit comprises any one or more of a data file, a document, a database, a data table homogeneous field (possibly unknown category) and the like; the additional preset type mark comprises a plurality of bits of preset type identification codes which are added in front of the sampling number;
s03: training samples, calculating the training value of each layer of the neural network model
For each set of input preprocessed sample data, marking the data bit number as m bits, where m = d + b, where b is the bit number of the preset type flag, and this embodiment b =3, d =800, m =803; x is the number of i Taking binary 0 or 1, 1-i ≦ m (i is only for the input data) for the ith bit value of the input data with the preset type mark; the neural network training model has L training layersThe first layer is input layer, the last layer is connected with output layer, the output layer of the invention does not count total number of layers, each layer has s l 1 ≦ L ≦ L, and the number of last layer activation items is consistent with the desired total number of classes K, i.e. s 1 =m,s L K (= K); this embodiment employs a 5-layer neural network training model (without calculating the output layer), i.e., L =5, and s 1 =s 2 =s 3 =s 4 =m=803,s 5 = K =20; the weight parameter of the jth bit of the training model from the jth bit of the l layer to the ith bit of the l +1 layer is1≤j≤s l ,1≤i≤s l+1 ;
Inputting sample data and initial values of the weight parameters, wherein the initial values are 0.5 during the first test, and one group of weight parameter results of the previous test is used as the initial values in the subsequent debugging process, and the group of parameters is generally considered to be more reasonable, so that the test and debugging process can be accelerated;
Wherein the content of the first and second substances,the superscript T is the matrix transposition for the weight parameter vector when all the intermediate training values of the l layer are transferred to the ith training value of the l +1 layer,X l in the middle of the first layer of the modelThe training value, or intermediate amount,
calculating layer by layer until the final layer of output value is obtainedI is more than or equal to 1 and less than or equal to K; recording the training result of the nth sampleMarking as1≤n≤N;
Repeating the above step S33 for each set of input preprocessed sample data;
s04: calculating a cost function, calculating a cost function J (theta) for all training results and weight parameters, theta being allA set of (a);
wherein λ is a deviation penalty parameter, λ is greater than or equal to 0 and less than or equal to 1, λ is an empirical parameter, which is generally a small quantity, and the value of the embodiment is 0.001; the cost function is essentially a sum of positive bias expectation, negative bias expectation and bias penalty terms of probability;
in this embodiment, the total number of samples N =30 ten thousand, of which 20 ten thousand are used for model training and 10 ten thousand are used for model testing after training, and when the test result is unsatisfactory, the parameters are adjusted to perform training and testing again;
s05: weight parameter optimization, for all weight parametersOptimizing J (Θ) such that J (Θ) is minimal, i.e.
Θ=minimizeJ(Θ)
The optimization comprises automatic optimization of general machine learning software or traversal optimization of weight parameter combination; the weight parameter combination traversal optimization comprises for eachTaking values one by one in the value range according to the specified step length, repeatedly executing the steps S03 and S04, and taking the minimum J (theta)As an optimization result; in this embodiment, a general neural network training model software (such as a tensrflow open-source machine learning platform) is adopted, and the principle is shown in fig. 3 and 4, after the number of training layers and each layer of activation item are defined, the software can automatically perform machine learning and output the optimized weight parameter vector Θ and the value of the sample function J (Θ).
The weight parameter estimation of the embodiment carries out self-learning estimation through a neural network training model, sample data is divided into two parts in the actual estimation process, one part is used for training and estimating the weight parameter, and the other part is used for testing the availability of the estimation result; all sample data are known in advance, and generally when the test yield reaches more than 90%, the classification method and the classification system have practical usability; generally, manual analysis is needed for the test data with inconsistent automatic classification results and preset results, reasons are found accurately, and model parameters are adjusted if necessary; the reason that the classification result is inaccurate is that the sampling bit number is not enough, and the original data characteristics cannot be obviously represented; the reason for the inaccurate classification may be, on the one hand, that the number of sampling bits is not sufficient, and, on the other hand, that the raw data does not match the intended fine classification, or at least that there are uncertain disturbing terms in the sampling results of the raw data.
The present embodiment is generally not retrained after the test training is completed, because retraining may mean that the previous classification result needs to be reclassified, which requires case-specific analysis. To ensure the availability of the classification system, the present embodiment requires a test yield of 99% or more, and all parameters and weights are fixed. When special conditions are not classified, the data can be manually analyzed to find reasons, no special requirements are needed, other parameters are not changed except for temporary adjustment of sampling digits, and all the reasons which are unknown are classified into problem data.
Example 2: government affair data labeling system
As shown in fig. 1, a government affair data tagging system at least comprises six modules of parameter estimation, data acquisition, data preprocessing, data analysis, tag output and data tagging, which are sequentially connected, and the six modules sequentially and respectively execute steps S0, S1, S2, S3, S4 and S5. The government data tagging system is essentially a computer system.
Further, as shown in fig. 2, the parameter estimation module includes five modules, i.e., sample data acquisition, sample data preprocessing, sample training, cost function calculation, and weight parameter optimization, and the five modules respectively execute steps S01, S02, S03, S04, and S05 in sequence.
The working principle of the system is shown in fig. 4, sample data is adopted for model training, and after a training result is obtained, the model is applied to actual data.
With regard to the method and system for automatically classifying government affair data in embodiments 1 and 2, in practical applications, the estimation of general weight parameters is not adjusted any more, and especially for the general government affair data, the new model training is performed to estimate the weight parameters mainly aiming at the new general class of original data to be subdivided, which is different from the original government affair data in classes, and the subdivision criteria may also be different. At this time, not only the model training needs to be performed again, but also parameters such as the sampling bit number, the preset type, the expected total classification number, and the like need to be reset. The method of the invention can be applied to the automatic subdivision of other kinds of data information.
The basic principle of the invention is as follows: determining the expected number of data tags according to the total possible fine category number demand of the government affair data; presetting general data types according to possible informatization representation modes of the original data unit, such as numbers, chinese characters, english characters, mixed numbers and characters, pictures, videos, texts and the like; collecting the clearly classified original data as a training sample to train the model, and acquiring reasonable weight parameters; and classifying and identifying the original data units to be classified by using the trained model, and filing after labeling.
The application of the invention is as follows: data are classified through the model, so that labels can be normally classified and established, and the usability of the data is ensured; carrying out deep analysis on data which cannot be matched, and trying fuzzy matching; repeated data can be cleaned; filing dirty data or polluted data, feeding back a related service system, and trying to acquire standard service data for the second time; finally, data which still cannot be matched are filed for future reference; the background of the system can perform statistical analysis on the matched tag data, change monitoring and data interface service, and realize intelligent data retrieval and supervision. For data with an unsatisfactory classification result, on one hand, the data can be directly classified into problem data or classified into other classes, on the other hand, the data can be manually analyzed, and the model can be trained again when necessary; for other types of data information or data information generated by other types of service systems, the method is applied to further subdivide the data categories, mainly the model parameters need corresponding designs, such as sampling bit number, preset types, the number of expected total categories and other elements.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and amendments, including the selection method of parameters, the initialization method, etc., can be made without departing from the technical principle of the present invention, and these modifications and amendments should also be regarded as the protection scope of the present invention.
Claims (4)
1. A government affair data labeling method is characterized in that original data are sampled, preset type marks are added, then the original data are input into a training model, and data types of the original data are output after the original data are classified by the training model; the data labels comprise data categories output by the training model; the raw data sampling comprises intercepting a plurality of paragraphs of raw data; the preset types comprise numbers, chinese characters, english characters, numbers and character mixture, pictures, videos and texts; the training model comprises a neural network model, and the government affair data labeling method comprises the following steps:
s0: training model weight parameter estimation, including empirical estimation or training model self-learning estimation, wherein the empirical estimation includes manually estimating each weight parameter value of the model according to personal experience;
s1: collecting data, namely collecting business system data, wherein the business system data comprises streaming or non-streaming data, structured or unstructured data, document data and internet data;
s2: data preprocessing, including original data sampling and adding a preset type mark, wherein the original data sampling comprises intercepting front continuous d bits of an original data unit; the original data unit comprises any one or more of data files, documents, databases, data tables and data table similar fields; the additional preset type mark comprises a plurality of bits of preset type identification codes which are added in front of the sampling number;
s3: data analysis, namely calculating classification information of each group of sampling data according to the weight parameters estimated by the training model parameter estimation method;
for each group of input preprocessed sampling data, the total data bit number is m bits, m = d + b, wherein b is the bit number of a preset type mark; for each set of inputs, note x i The ith bit value of the input data with the preset type mark is added; the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, and each layer has s l 1 ≦ L ≦ L, and the number of last layer activation items is consistent with the desired total number of classes K, i.e. s 1 =m,s L K (= K); the weight parameter of the j th bit of the training model to the i th bit of the l +1 th layer is1≤j≤s l ,1≤i≤s l+1 ;
Ith bit input value of first layerSecond layer ith intermediate training valueThe ith intermediate training value of the layerIs provided with
Wherein, the first and the second end of the pipe are connected with each other,for the weight parameter vector when all the middle training values of the l-th layer are transferred to the i-th training value of the l + 1-th layer, the superscript T is matrix transposition,X l for the model layer i intermediate training values,
S4: output the classification result, i.e. data label h, whenWhen it is, take h i =0; when in useWhen it is, take h i =1; when the temperature is higher than the set temperatureWhen the data is abnormal, the data does not belong to the trained data category or the classification model needs to be retrained; wherein K 1 Is 0 probability error range, K 2 Is 1 probability error range; classification result h = [ h = 1 ,h 2 ,…,h K ];
S5: and labeling and archiving the data, and labeling the original data unit corresponding to the sampling data by using the classification result h and then archiving and storing.
2. The labeling method for government affairs data according to claim 1, wherein the training model self-learning estimation in step S0 comprises a neural network training model parameter estimation method, comprising the following steps:
s01: acquiring sample data, counting the total number of samples as N, and recording the data label of the nth sampleIt is known that, when the sample belongs to the k-th class,otherwiseWherein N is more than or equal to 1 and less than or equal to N, and K is more than or equal to 1 and less than or equal to K;
s02: sample data preprocessing, including sample data sampling and attaching a preset type mark, wherein the sample data sampling comprises intercepting front continuous d bits of a sample data unit; the sample data unit comprises any one or more of data files, documents, databases, data tables and data table similar fields; the additional preset type mark comprises a plurality of bits of preset type identification codes which are added in front of the sampling number;
s03: training samples, calculating the training value of each layer of the neural network model
Recording data bits for each set of input pre-processed sample dataThe number is m bits, m = d + b, where b is the number of bits of the preset type mark; x is a radical of a fluorine atom i The ith bit value of the input data with the preset type mark is added; the neural network training model has L training layers, wherein the first layer is an input layer, the last layer is connected with an output layer, and each layer has s l 1 ≦ L ≦ L, and the number of last layer activation items is consistent with the desired total number of classes K, i.e. s 1 =m,s L K (= K); the weight parameter of the jth bit of the training model from the jth bit of the l layer to the ith bit of the l +1 layer is
Inputting sample data and an initial value of a weight parameter;
Wherein the content of the first and second substances,for the weight parameter vector when all the middle training values of the l-th layer are transferred to the i-th training value of the l + 1-th layer, the superscript T is matrix transposition,X l for the model layer i intermediate training values,
calculating layer by layer until the last layer of output value is obtainedRecording the training result of the nth sampleMarking as1≤n≤N;
Repeating the above step S33 for each set of input preprocessed sample data;
s04: calculating a cost function, calculating a cost function J (theta) for all training results and weight parameters, theta being allA set of (a);
wherein, λ is a deviation punishment parameter, λ is more than or equal to 0 and less than or equal to 1;
s05: weight parameter optimization, for all weight parametersOptimizing J (Θ) such that J (Θ) is minimal, i.e.
Θ=minimizeJ(Θ)
The optimizing comprises automatic optimizing of general machine learning software or traversing optimizing of weight parameter combination; the weight parameter combination traversal optimization comprises for eachTaking values one by one in a value range according to the specified step length, and repeatingExecuting steps S03 and S04, and taking J (theta) minimumAs a result of the optimization.
3. A government affair data labeling system at least comprises six modules of parameter estimation, data acquisition, data preprocessing, data analysis, label output and data labeling which are sequentially connected, wherein the six modules sequentially execute the steps S0, S1, S2, S3, S4 and S5 of claim 1 respectively.
4. The system according to claim 3, wherein the parameter estimation module comprises five modules of sample data acquisition, sample data preprocessing, sample training, cost function calculation and weight parameter optimization, and the five modules respectively execute the steps S01, S02, S03, S04 and S05 in sequence according to claim 2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210922987.7A CN115357769B (en) | 2022-08-02 | 2022-08-02 | Government affair data labeling system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210922987.7A CN115357769B (en) | 2022-08-02 | 2022-08-02 | Government affair data labeling system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115357769A true CN115357769A (en) | 2022-11-18 |
CN115357769B CN115357769B (en) | 2023-07-04 |
Family
ID=84033026
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210922987.7A Active CN115357769B (en) | 2022-08-02 | 2022-08-02 | Government affair data labeling system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115357769B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060195782A1 (en) * | 2005-02-28 | 2006-08-31 | Microsoft Corporation | Method and system for classifying and displaying tables of information |
CN110110756A (en) * | 2019-04-09 | 2019-08-09 | 北京中科智营科技发展有限公司 | A kind of data classification optimization method and optimization device |
US20210287089A1 (en) * | 2020-03-14 | 2021-09-16 | DataRobot, Inc. | Automated and adaptive design and training of neural networks |
US20210306200A1 (en) * | 2015-01-27 | 2021-09-30 | Moogsoft Inc. | System for decomposing events and unstructured data |
CN113505222A (en) * | 2021-06-21 | 2021-10-15 | 山东师范大学 | Government affair text classification method and system based on text circulation neural network |
CN114091472A (en) * | 2022-01-20 | 2022-02-25 | 北京零点远景网络科技有限公司 | Training method of multi-label classification model |
-
2022
- 2022-08-02 CN CN202210922987.7A patent/CN115357769B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060195782A1 (en) * | 2005-02-28 | 2006-08-31 | Microsoft Corporation | Method and system for classifying and displaying tables of information |
US20210306200A1 (en) * | 2015-01-27 | 2021-09-30 | Moogsoft Inc. | System for decomposing events and unstructured data |
CN110110756A (en) * | 2019-04-09 | 2019-08-09 | 北京中科智营科技发展有限公司 | A kind of data classification optimization method and optimization device |
US20210287089A1 (en) * | 2020-03-14 | 2021-09-16 | DataRobot, Inc. | Automated and adaptive design and training of neural networks |
CN113505222A (en) * | 2021-06-21 | 2021-10-15 | 山东师范大学 | Government affair text classification method and system based on text circulation neural network |
CN114091472A (en) * | 2022-01-20 | 2022-02-25 | 北京零点远景网络科技有限公司 | Training method of multi-label classification model |
Non-Patent Citations (1)
Title |
---|
胡甜甜: "面向政务领域知识抽取的研究与应用", 中国优秀硕士学位论文全文数据库 社会科学I辑 * |
Also Published As
Publication number | Publication date |
---|---|
CN115357769B (en) | 2023-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109543084B (en) | Method for establishing detection model of hidden sensitive text facing network social media | |
CN109165294B (en) | Short text classification method based on Bayesian classification | |
CN111222305B (en) | Information structuring method and device | |
Adams et al. | Automating image matching, cataloging, and analysis for photo-identification research | |
CN108573031A (en) | A kind of complaint sorting technique and system based on content | |
CN109492230B (en) | Method for extracting insurance contract key information based on interested text field convolutional neural network | |
CN112163424A (en) | Data labeling method, device, equipment and medium | |
CN112560478A (en) | Chinese address RoBERTA-BilSTM-CRF coupling analysis method using semantic annotation | |
CN115293131B (en) | Data matching method, device, equipment and storage medium | |
CN113051914A (en) | Enterprise hidden label extraction method and device based on multi-feature dynamic portrait | |
CN113515600B (en) | Automatic calculation method for spatial analysis based on metadata | |
CN115759104B (en) | Financial domain public opinion analysis method and system based on entity identification | |
CN114416979A (en) | Text query method, text query equipment and storage medium | |
CN115659226A (en) | Data processing system for acquiring APP label | |
CN114297987A (en) | Document information extraction method and system based on text classification and reading understanding | |
CN110705384B (en) | Vehicle re-identification method based on cross-domain migration enhanced representation | |
CN109783483A (en) | A kind of method, apparatus of data preparation, computer storage medium and terminal | |
CN115357769B (en) | Government affair data labeling system and method | |
CN111966640A (en) | Document file identification method and system | |
CN116434273A (en) | Multi-label prediction method and system based on single positive label | |
CN115599885A (en) | Document full-text retrieval method and device, computer equipment, storage medium and product | |
CN114722183A (en) | Knowledge pushing method and system for scientific research tasks | |
CN113254612A (en) | Knowledge question-answering processing method, device, equipment and storage medium | |
CN112818215A (en) | Product data processing method, device, equipment and storage medium | |
CN113297845B (en) | Resume block classification method based on multi-level bidirectional circulation neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |