CN112381143B

CN112381143B - Automatic variable classification method and system based on machine learning

Info

Publication number: CN112381143B
Application number: CN202011272803.4A
Authority: CN
Inventors: 魏强; 孙向学; 张上亚; 王臣亮; 张学敬; 翟迪; 马静静; 郁峰
Original assignee: New Great Wall Technology Co ltd
Current assignee: New Great Wall Technology Co ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2023-12-05
Anticipated expiration: 2040-11-13
Also published as: CN112381143A

Abstract

The application discloses a variable automatic classification method and system based on machine learning, and relates to the technical field of information processing. The method comprises the following steps: acquiring a report to be processed, and extracting text information of the report; extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, and extracting variable feature words from the words; extracting variable feature words from the part-of-speech recognition objects, and comparing the extracted variable feature words with variables in a variable word stock to form classification rules for extracting feature words; and extracting the variable feature words into corresponding variable blocks according to the classification rules. The automatic variable classification method provided by the application is realized based on machine learning, is suitable for automatic variable classification of statistical reports, realizes the method for creating automatic variable classification by using machine learning, and can solve the complex variable identification work in the data statistics process.

Description

Automatic variable classification method and system based on machine learning

Technical Field

The application relates to the technical field of information processing, in particular to a variable automatic classification method and system based on machine learning.

Background

At present, when the data of the statistics report forms are integrated, a method of identifying text characters in a main guest bar by human function reality and distinguishing whether variables in the main guest bar are indexes or grouping items is mostly adopted. The error rate of the program identification variable is high, manual verification is needed, the requirement on personnel business level is high, and the problems of manual identification errors and the like are also frequently caused.

Disclosure of Invention

The application aims to solve the technical problem of providing a variable automatic classification method and system based on machine learning aiming at the defects of the prior art.

The technical scheme for solving the technical problems is as follows:

a machine learning based automatic classification method for variables, comprising:

acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object;

extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable feature words from the words, and storing the extracted variable feature words in a part-of-speech recognition object;

extracting the variable feature words from the part-of-speech recognition objects, and comparing the extracted variable feature words with variables in a variable word stock to form classification rules for extracting feature words;

and extracting the variable feature words into corresponding variable blocks according to the classification rules.

The automatic variable classification method provided by the application is realized based on machine learning, is suitable for automatic variable classification of statistical reports, and can solve complex variable identification work in the data statistics process by extracting text information of the reports, sequentially carrying out part-of-speech identification and feature extraction respectively, then comparing the extracted text information with variables in a variable word stock, constructing classification rules, and then carrying out automatic classification according to the classification rules.

Further, the application can be modified as follows:

acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object, wherein the method specifically comprises the following steps:

and acquiring a report to be processed, identifying all filled content areas in the report, identifying data in each cell, judging the data type of the data in each cell, and storing the identified data and the data type in a text object.

The beneficial effects of adopting the further scheme are as follows: by carrying out recognition processing on the filled content area, the following steps of part-of-speech recognition, feature extraction and the like of the data can be conveniently carried out, so that the classification precision is improved.

Further, the application can be modified as follows:

extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable feature words from the words, and storing the extracted variable feature words in a part-of-speech recognition object, wherein the method specifically comprises the following steps of:

extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, determining that each word is a noun, a verb, an adjective or an adjective, if the word is an adjective, removing the corresponding word, taking the rest words as variable feature words, and storing the extracted variable feature words in the part-of-speech recognition object.

The beneficial effects of adopting the further scheme are as follows: if the virtual words are used as the characteristic words, the method brings great noise, so that the efficiency and accuracy of the subsequent variable classification are directly reduced. Therefore, when the variable characteristics are extracted, the virtual words with smaller classification use are removed, and words with stronger expressive force on variable classification such as real words are used, so that the efficiency and accuracy of subsequent variable classification can be further improved.

Further, the application can be modified as follows:

extracting the variable feature words into corresponding variable blocks according to the classification rules, wherein the method specifically comprises the following steps:

when the variable feature words are regions or codes, adding the variable feature words into code blocks;

when the variable feature words are grouped, adding the variable feature words into a grouping block;

when the variable feature words are measurement units, adding the variable feature words into a measurement unit block;

and when the variable feature words are indexes, adding the variable feature words into a metering index block.

The beneficial effects of adopting the further scheme are as follows: by adding variable feature words to different variable blocks according to their types, accurate variable classification can be achieved.

Further, the application can be modified as follows:

the automatic variable classification method based on machine learning further comprises the following steps:

and the variable word stock cleans repeated records in each variable block according to a preset cleaning rule to construct a standard variable word stock.

The beneficial effects of adopting the further scheme are as follows: repeated record items in the word stock blocks are cleaned, and versions of word stock variables are constructed, so that the subsequent automatic identification method can be conveniently used.

The other technical scheme for solving the technical problems is as follows:

a machine learning based variable automatic classification system comprising:

the acquisition unit is used for acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object;

the recognition unit is used for extracting the text information from the text object, splitting the text information into words by utilizing a preset word segmentation algorithm, extracting variable feature words from the words, and storing the extracted variable feature words in the part-of-speech recognition object;

the matching unit is used for extracting the variable feature words from the part-of-speech recognition objects, and comparing the extracted variable feature words with variables in a variable word stock to form classification rules for extracting feature words;

and the classification unit is used for extracting the variable feature words into corresponding variable blocks according to the classification rules.

The automatic variable classification system is realized based on machine learning, is suitable for automatic variable classification of statistical reports, extracts text information of the reports, sequentially performs part-of-speech recognition and feature extraction respectively, compares the extracted text information with variables in a variable word stock, constructs classification rules, and automatically classifies the variable according to the classification rules, so that a method for creating automatic variable classification by machine learning is realized, and complex variable recognition work in a data statistics process can be solved.

Further, the application can be modified as follows:

the acquisition unit is specifically configured to acquire a report to be processed, identify all filled content areas in the report, identify data in each cell, determine a data type of the data in each cell, and store the identified data and data types in a text object.

Further, the application can be modified as follows:

the recognition unit is specifically configured to extract the text information from the text object, split the text information into terms by using a preset word segmentation algorithm, determine that each term is a noun, a verb, an adjective or an imaginary term, reject the corresponding term if the term is the imaginary term, take the remaining terms as variable feature words, and store the extracted variable feature words in the part-of-speech recognition object.

Further, the application can be modified as follows:

the classifying unit is specifically used for:

Further, the application can be modified as follows:

the automatic variable classification system based on machine learning further comprises:

and the cleaning unit is used for enabling the variable word stock to clean repeated records in each variable block according to a preset cleaning rule, and constructing a standard variable word stock.

Additional aspects of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of an automatic variable classification method according to the present application;

FIG. 2 is a diagram of data according to an embodiment of the present application;

FIG. 3 is another schematic diagram of data in an embodiment of the present application;

FIG. 4 is a block diagram of an embodiment of an automatic variable classification system according to the present application.

Detailed Description

The principles and features of the present application are described below with reference to the drawings, the illustrated embodiments are provided for illustration only and are not intended to limit the scope of the present application.

As shown in fig. 1, a flow chart is provided for an embodiment of the automatic variable classification method of the present application, where the automatic variable classification method is implemented based on machine learning and is applicable to automatic variable classification of statistical type report forms, and the automatic variable classification method includes:

s1, acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object;

it should be noted that, the text information of the report may be text in the main column, for example, as shown in fig. 2, an exemplary report schematic diagram is given, the left half of the report is the main column, the data information of the report is recorded, the top of the report is the column, and the statistical mode of the report corresponding to the data information is recorded, that is, in a report table, the intersection area of the main column and the column is a filling area, usually the filled data, and the indexes corresponding to the data are determined by the contents of the main column and the table head.

S2, extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable feature words from the words, and storing the extracted variable feature words in the part-of-speech recognition object;

it should be noted that, the word segmentation algorithm is used for splitting the text into words and distinguishing nouns, verbs, adjectives, and imaginary words, where the imaginary words may include exclamation words, prepositions, conjunctions, and the like, which may be implemented by the existing procedure and are not described herein again.

Since the method takes the virtual words as the characteristic words to bring great noise, the efficiency and accuracy of the classification of the variables at the back are directly reduced, therefore, when the variable characteristics are extracted, firstly, the virtual words which are not used for classification are considered to be removed, and in the real words, the expressive force of nouns and verbs on the variable classification is the strongest, so that the method can preferably extract only the nouns and the verbs as the characteristic words of the variables.

S3, extracting variable feature words from the part-of-speech recognition objects, and comparing the extracted variable feature words with variables in a variable word stock to form classification rules for extracting feature words;

it should be understood that if the corresponding variable feature word does not exist in the variable word stock, the feature word is placed in the object to be processed.

It should be noted that, for the variable comparison of the single feature word and the variable word stock, it can be determined whether the feature word and the element in the variable word stock are completely matched.

For the matching of unidentified single feature words and elements in a variable word stock, a plurality of feature word combinations can be compared with the variable word stock, and the multi-feature words are combined for a plurality of times and matched with the elements in the variable word stock.

For example, assuming that there are two identified feature words, "plan" and "total investment," respectively, and that this feature word is not matched from the variable word stock for "plan," then "plan" and "total investment" may be combined to obtain "plan total investment," which is matched again using the word.

Then, the matched feature words and the unmatched feature words are processed by using NLP (natural language processing) through intelligent recognition, for example, the feature words can be divided into variable blocks such as { codes, groups, regions, measurement units, indexes }, and the recognized feature words are extracted. And are thus classified according to these variables.

And S4, extracting the variable feature words into corresponding variable blocks according to the classification rules.

For example, after extracting the feature words, the feature words are compared with classification rules in the variable word stock to form the following rules:

when the feature word is a region or code: text information is added to the code blocks.

When the feature words are grouped: the feature words are added to the grouping blocks.

When the feature word is a measurement unit: the feature words are added to their metering unit blocks.

When the feature words are indexes: the feature words are added to their metric blocks.

The automatic variable classification method provided by the embodiment is realized based on machine learning, is suitable for automatic variable classification of statistical reports, and can solve the complex variable identification work in the data statistics process by extracting text information of the reports, sequentially carrying out part-of-speech identification and feature extraction respectively, then comparing the extracted text information with variables in a variable word stock, constructing classification rules, and then carrying out automatic classification according to the classification rules.

Optionally, in some possible embodiments, the method includes obtaining a report to be processed, extracting text information of the report, and storing the identified text information in a text object, including:

the method comprises the steps of obtaining a report to be processed, identifying all filled content areas in the report, identifying data in each cell, judging data types of the data in each cell, and storing the identified data and the data types in a text object.

Specifically, all the filled content areas in the report can be identified, and the type judgment of filling data of each unit cell is carried out on the content areas.

The method can be specifically judged according to the following rules:

for example, the content before the colon may be deleted: such as "wherein: "in the aggregate: "etc.;

for example, the units of measure in text brackets may be identified: such as: "ten thousand yuan, hundred million yuan, ton" etc.;

for example, a reporting period in text brackets may be identified: identifying through regular matching;

for example, known useless remark information may be deleted: such as: { one, two, three, four, five, continuous one, continuous two, continuous three };

for example, other information may be identified: such as "(100% of the year in the same period) and (100% of the year in the same period), and the like.

By carrying out recognition processing on the filled content area, the following steps of part-of-speech recognition, feature extraction and the like of the data can be conveniently carried out, so that the classification precision is improved.

Optionally, in some possible embodiments, text information is extracted from the text object, the text information is split into words by using a preset word segmentation algorithm, variable feature words are extracted from the words, and the extracted variable feature words are stored in the part-of-speech recognition object, which specifically includes:

extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, determining that each word is a noun, a verb, an adjective or an adjective, if the word is an adjective, removing the corresponding word, taking the rest words as variable feature words, and storing the extracted variable feature words in the part-of-speech recognition object.

If the virtual words are used as the characteristic words, the method brings great noise, so that the efficiency and accuracy of the subsequent variable classification are directly reduced. Therefore, when the variable characteristics are extracted, the virtual words with smaller classification use are removed, and words with stronger expressive force on variable classification such as real words are used, so that the efficiency and accuracy of subsequent variable classification can be further improved.

Preferably, only nouns and verbs may be extracted as feature words of variables.

Optionally, in some possible embodiments, extracting the variable feature words into the corresponding variable blocks according to the classification rule specifically includes:

when the variable feature words are regions or codes, adding the variable feature words into the code blocks;

when the variable feature words are grouping, adding the variable feature words into a grouping block;

when the variable feature words are indexes, the variable feature words are added into the metering index block.

By adding variable feature words to different variable blocks according to their types, accurate variable classification can be achieved.

Optionally, in some possible embodiments, the automatic classification method of variables based on machine learning further includes:

and cleaning repeated records in each variable block according to a preset cleaning rule by the variable word stock to construct a standard variable word stock.

Repeated record items in the word stock blocks are cleaned, and versions of word stock variables are constructed, so that the subsequent automatic identification method can be conveniently used.

Specific examples are given below in connection with fig. 2 and 3.

Firstly, after the report is imported, firstly identifying all characters of the report and the fillable area of the report, determining a main column and a guest column, and identifying parts of speech, extracting characters on the main guest column, comparing the characters with a business word stock, and determining whether the characters are variables, namely: index, grouping, units, etc.

Second, some variables are regular by finding features of the variables, such as:

the meta-recognition is: [ MEANS FOR METHOD ] Currency

The gas identification is as follows: energy variable the resource.

Electric power identification person: infrastructure, civilian, energy variable.

Then find out which range the corresponding variable is in through the operation rule.

Again, by reporting how many variable feature words are in the main guest column, such as "power, gas and water supply" and by combining feature words with the indices in the variable library, it is determined with which high probability the combination is likely to match (e.g., index, group.

Finally, determining the crossing unit cells of the main guest column, namely which indexes are marked in the data filling area according to other contents of the current report, and marking the corresponding variable components in a marking mode automatically as shown in fig. 3.

When the user is filling in data, the filled-in number has the properties of the variables, such as index, group, unit.

It is to be understood that in some possible implementations, some other examples may include all or part of any of the implementations described above, as may be practical.

As shown in fig. 4, a structural framework diagram is provided for an embodiment of the automatic variable classification system of the present application, which is implemented based on machine learning and is suitable for automatic classification of variables of statistical type report forms, and the automatic variable classification system includes:

the acquisition unit 1 is used for acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object;

the recognition unit 2 is used for extracting text information from the text object, splitting the text information into words by utilizing a preset word segmentation algorithm, extracting variable feature words from the words, and storing the extracted variable feature words in the part-of-speech recognition object;

the matching unit 3 is used for extracting variable feature words from the part-of-speech recognition objects, and comparing the extracted variable feature words with variables in a variable word stock to form classification rules for extracting feature words; and the classification unit 4 is used for extracting the variable characteristic words into corresponding variable blocks according to the classification rules.

The variable automatic classification system provided by the embodiment is realized based on machine learning, is suitable for automatic classification of variables of statistical reports, and can solve complex variable identification work in the data statistics process by extracting text information of the reports, sequentially carrying out part-of-speech identification and feature extraction respectively, then comparing the extracted text information with variables in a variable word stock, constructing classification rules, and then carrying out automatic classification according to the classification rules.

Optionally, in some possible embodiments, the obtaining unit 1 is specifically configured to obtain a report to be processed, identify all the filled content areas in the report, identify data in each cell, determine a data type of the data in each cell, and store the identified data and the data type in the text object.

Optionally, in some possible embodiments, the recognition unit 2 is specifically configured to extract text information from the text object, split the text information into terms by using a preset word segmentation algorithm, determine that each term is a noun, a verb, an adjective or an imaginary term, if the term is an imaginary term, reject the corresponding term, use the remaining terms as variable feature terms, and store the extracted variable feature terms in the part-of-speech recognition object.

Alternatively, in some possible embodiments, the classification unit 3 is specifically configured to:

Optionally, in some possible embodiments, the machine learning based variable automatic classification system further comprises:

and the cleaning unit is used for enabling the variable word stock to clean repeated records in each variable block according to a preset cleaning rule and constructing a standard variable word stock.

It should be understood that the above embodiments are product embodiments corresponding to the method embodiments of the present application, and the technical solutions of the two embodiments correspond to each other, so that specific descriptions of the above product embodiments may refer to the above method embodiments, and are not repeated herein.

It will be appreciated that the present application may also provide a storage medium having stored therein instructions which, when read by a computer, cause the computer to perform the machine learning based variable automatic classification method of any of the above embodiments.

It can be appreciated that the present application may also provide an electronic device including:

a memory for storing a computer program;

and a processor for executing a computer program to implement the automatic classification method of variables based on machine learning according to any of the above embodiments.

The reader will appreciate that in the description of this specification, a description of the terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the method embodiments described above are merely illustrative, e.g., the division of steps is merely a logical function division, and there may be additional divisions of actual implementation, e.g., multiple steps may be combined or integrated into another step, or some features may be omitted or not performed.

The above-described method, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the present application, and these modifications and substitutions are intended to be included in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A machine learning based automatic classification method for variables, comprising:

extracting the variable feature words into corresponding variable blocks according to the classification rules;

acquiring a report to be processed, identifying all filled content areas in the report, identifying data in each cell, judging the data type of the data in each cell, and storing the identified data and the data type in a text object;

2. The automatic classification method of variable based on machine learning according to claim 1, wherein extracting the variable feature words into corresponding variable blocks according to the classification rule specifically comprises:

3. The machine learning based variable automatic classification method according to claim 1 or 2, further comprising:

4. A machine learning based automatic variable classification system comprising:

the classification unit is used for extracting the variable feature words into corresponding variable blocks according to the classification rules;

the acquisition unit is specifically used for acquiring a report to be processed, identifying all filled content areas in the report, identifying data in each cell, judging the data type of the data in each cell, and storing the identified data and data types in a text object;

5. The automatic classification system of machine learning based variables of claim 4, wherein the classification unit is specifically configured to:

6. The machine learning based variable automatic classification system of claim 4 or 5, further comprising: