CN112381143A

CN112381143A - Variable automatic classification method and system based on machine learning

Info

Publication number: CN112381143A
Application number: CN202011272803.4A
Authority: CN
Inventors: 魏强; 孙向学; 张上亚; 王臣亮; 张学敬; 翟迪; 马静静; 郁峰
Original assignee: Great Wall Computer Software & Systems Inc
Current assignee: Great Wall Computer Software & Systems Inc
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2021-02-19
Anticipated expiration: 2040-11-13
Also published as: CN112381143B

Abstract

The invention discloses a variable automatic classification method and system based on machine learning, and relates to the technical field of information processing. The method comprises the following steps: acquiring a report to be processed, and extracting text information of the report; extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, and extracting variable characteristic words from the words; extracting variable characteristic words from the part-of-speech recognition object, and comparing the extracted variable characteristic words with variables in a variable word bank to form a classification rule for extracting the characteristic words; and extracting the variable feature words into corresponding variable blocks according to the classification rules. The automatic variable classification method provided by the invention is realized based on machine learning, is suitable for automatic variable classification of statistical reports, realizes a method for establishing automatic variable classification by utilizing machine learning, and can solve the problem of complex variable identification work in the data statistics process.

Description

Variable automatic classification method and system based on machine learning

Technical Field

The invention relates to the technical field of information processing, in particular to a variable automatic classification method and system based on machine learning.

Background

At present, when data of a statistical report is sorted, a method of identifying text characters in a main guest column and distinguishing whether variables in the main guest column are indexes or grouped items is mostly adopted. The error rate of variable identification by a program is high, manual check is needed, the requirement on the personnel service level is high, and the problems of manual identification errors and the like often occur.

Disclosure of Invention

The invention aims to solve the technical problem of the prior art and provides a variable automatic classification method and system based on machine learning.

The technical scheme for solving the technical problems is as follows:

a variable automatic classification method based on machine learning comprises the following steps:

acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object;

extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable characteristic words from the words, and storing the extracted variable characteristic words in a part-of-speech recognition object;

extracting the variable characteristic words from the part-of-speech recognition object, and comparing the extracted variable characteristic words with variables in a variable word bank to form a classification rule for extracting the characteristic words;

and extracting the variable feature words into corresponding variable blocks according to the classification rules.

The method for automatically classifying the variables is realized based on machine learning, is suitable for automatically classifying the variables of the statistical report, extracts the text information of the report, respectively performs part-of-speech recognition and feature extraction in sequence, compares the text information with the variables in the variable lexicon to construct classification rules, and automatically classifies according to the classification rules, so that the method for establishing the automatic classification of the variables by using machine learning is realized, and the complicated variable recognition work in the data statistics process can be solved.

Further, the invention can be improved as follows:

the method comprises the steps of obtaining a report to be processed, extracting text information of the report, and storing the identified text information in a text object, and specifically comprises the following steps:

the method comprises the steps of obtaining a report to be processed, identifying all filled content areas in the report, identifying data in each cell, judging the data type of the data in each cell, and storing the identified data and the data type in a text object.

The beneficial effect of adopting the further scheme is that: by identifying and processing the filled content area, the subsequent steps of part-of-speech identification, feature extraction and the like on the data can be conveniently carried out, so that the classification precision is improved.

Further, the invention can be improved as follows:

extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable characteristic words from the words, and storing the extracted variable characteristic words in a part-of-speech recognition object, wherein the method specifically comprises the following steps:

extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, determining that each word is a noun, a verb, an adjective or a virtual word, if the word is the virtual word, removing the corresponding word, taking the rest words as variable characteristic words, and storing the extracted variable characteristic words in a part-of-speech recognition object.

The beneficial effect of adopting the further scheme is that: if the particle word is used as the feature word, great noise is brought, so that the efficiency and the accuracy of the subsequent variable classification are directly reduced. Therefore, when the variable features are extracted, the virtual words with small classification use are removed, and words with strong expressive force on variable classification such as real words are used, so that the efficiency and the accuracy of subsequent variable classification can be further improved.

Further, the invention can be improved as follows:

extracting the variable feature words into corresponding variable blocks according to the classification rules, which specifically comprises:

when the variable characteristic words are regions or codes, adding the variable characteristic words into a code block;

when the variable characteristic words are the groups, adding the variable characteristic words into the grouping blocks;

when the variable characteristic words are measuring units, adding the variable characteristic words into a measuring unit block;

and when the variable characteristic words are indexes, adding the variable characteristic words into a metering index block.

The beneficial effect of adopting the further scheme is that: by adding the variable feature words to different variable blocks according to the types of the variable feature words, accurate variable classification can be achieved.

Further, the invention can be improved as follows:

the automatic variable classification method based on machine learning further comprises the following steps:

and the variable word bank cleans repeated records in each variable block according to a preset cleaning rule to construct a standard variable word bank.

The beneficial effect of adopting the further scheme is that: repeated record items in the word stock block are cleaned, and versions of word stock storage variables are constructed, so that the subsequent automatic identification method can be conveniently used.

Another technical solution of the present invention for solving the above technical problems is as follows:

a machine learning based variable automatic classification system comprising:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a report to be processed, extracting text information of the report and storing the identified text information in a text object;

the recognition unit is used for extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable characteristic words from the words, and storing the extracted variable characteristic words in a part-of-speech recognition object;

the matching unit is used for extracting the variable characteristic words from the part-of-speech recognition objects, comparing the extracted variable characteristic words with variables in a variable word bank and forming a classification rule for extracting the characteristic words;

and the classification unit is used for extracting the variable feature words into corresponding variable blocks according to the classification rules.

The automatic variable classification system provided by the invention is realized based on machine learning, is suitable for automatic variable classification of statistical reports, extracts the text information of the reports, respectively performs part-of-speech recognition and feature extraction in sequence, compares the text information with the variables in the variable lexicon to construct classification rules, and automatically classifies according to the classification rules, so that the method for establishing automatic variable classification by using machine learning is realized, and the complicated variable identification work in the data statistics process can be solved.

Further, the invention can be improved as follows:

the acquisition unit is specifically used for acquiring the report to be processed, identifying all filled content areas in the report, identifying the data in each cell and judging the data type of the data in each cell, and storing the identified data and the data type in the text object.

Further, the invention can be improved as follows:

the recognition unit is specifically configured to extract the text information from the text object, split the text information into words by using a preset word segmentation algorithm, determine that each word is a noun, a verb, an adjective or an imaginary word, if the word is an imaginary word, remove the corresponding word, use the remaining words as variable feature words, and store the extracted variable feature words in a part-of-speech recognition object.

Further, the invention can be improved as follows:

the classification unit is specifically configured to:

Further, the invention can be improved as follows:

the machine learning based variable automatic classification system further comprises:

and the cleaning unit is used for cleaning the repeated records in each variable block by the variable word bank according to a preset cleaning rule to construct a standard variable word bank.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a schematic flow chart diagram of a method for automatic classification of variables according to an embodiment of the present invention;

FIG. 2 is a data diagram of an embodiment of the present invention;

FIG. 3 is another data diagram in accordance with an embodiment of the present invention;

FIG. 4 is a structural framework diagram provided by an embodiment of the automatic variable classification system of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth to illustrate, but are not to be construed to limit the scope of the invention.

As shown in fig. 1, a schematic flow chart provided by an embodiment of the automatic variable classification method according to the present invention is implemented based on machine learning, and is suitable for automatic variable classification of a statistical report, and the automatic variable classification method includes:

s1, acquiring the report to be processed, extracting the text information of the report, and storing the identified text information in the text object;

it should be noted that the text information of the report may be text words in a main column, for example, as shown in fig. 2, an exemplary report diagram is provided, a left half of the report is a main column, which records data information of the report, a top of the report is a guest column, which records a statistical manner of data information corresponding to the report, that is, in a report, an intersection area of the main column and the guest column is a filling area, which is usually filled data, and which indexes correspond to the data are determined by contents of the main column and a header.

S2, extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable characteristic words from the words, and storing the extracted variable characteristic words in a part-of-speech recognition object;

it should be noted that the word segmentation algorithm is used to segment the text into words and phrases and distinguish nouns, verbs, adjectives, fictional words, and the like, where the fictional words may include exclamatory words, prepositions, conjunctions, and the like, which may be implemented by existing programs and will not be described herein again.

Since the virtual words as the feature words will bring much noise, thereby directly reducing the efficiency and accuracy of the variable classification after the virtual words are classified, when extracting the variable features, the virtual words which are not useful for classification should be removed first, and in the real words, the expressive force of nouns and verbs on the variable classification is strongest, so that the feature words which only can extract nouns and verbs as variables are preferred.

S3, extracting variable characteristic words from the part-of-speech recognition objects, and comparing the extracted variable characteristic words with variables in a variable word bank to form a classification rule for extracting the characteristic words;

it should be understood that if no corresponding variable feature word exists in the variable word library, the feature word is put into the object to be processed.

It should be noted that, for the comparison between the single feature word and the variable in the variable lexicon, it can be determined whether the feature word is completely matched with the element in the variable lexicon.

For the matching of the unidentified single characteristic words and the elements in the variable word stock, a plurality of characteristic word combinations can be compared with the variable word stock, and the multi-characteristic words are combined for multiple times and matched with the elements in the variable word stock.

For example, assuming that there are two feature words identified, respectively "plan" and "total investment", assuming that this feature word is not matched from the variable lexicon for "plan", then "plan" and "total investment" can be combined to get "plan total investment", and the word used to match again.

Then, the matched feature words and the unmatched feature words are processed by using NLP (natural language processing) through intelligent recognition, for example, the matched feature words and the unmatched feature words can be divided into variable blocks such as { coding, grouping, region, measurement unit, index } and the like, and the recognized feature words are extracted. And thus classified according to the variable blocks.

And S4, extracting the variable feature words into corresponding variable blocks according to the classification rules.

For example, after extracting the feature words, the feature words are compared with the classification rules in the variable lexicon to form the following rules:

when the characteristic words are regions or codes: the text information is added to the code block.

When the characteristic words are grouped: the feature words are added to the chunked.

When the characteristic word is a measurement unit: the feature words are added to their unit of measure blocks.

When the characteristic words are indexes: the feature words are added to their metering index blocks.

The method for automatically classifying variables provided by the embodiment is realized based on machine learning, is suitable for automatically classifying the variables of the statistical report, extracts the text information of the report, respectively performs part-of-speech recognition and feature extraction in sequence, compares the extracted text information with the variables in the variable lexicon to construct classification rules, and automatically classifies according to the classification rules, so that the method for establishing the automatic classification of the variables by using machine learning is realized, and the complicated variable recognition work in the data statistics process can be solved.

Optionally, in some possible embodiments, the obtaining of the report to be processed, extracting the text information of the report, and storing the identified text information in the text object specifically includes:

Specifically, all the filled content areas in the report can be identified, and the type of the filled data of the content areas is judged one by one.

Specifically, the following rules can be used for judgment:

for example, the content before the colon can be deleted: such as "wherein: "," in total: "and the like;

for example, the units of measure in text brackets may be identified: such as: "ten thousand yuan, hundred million yuan, ton", etc.;

for example, a report period in text brackets may be identified: identifying through regular matching;

for example, known unwanted remark information may be deleted: such as: { one, two, three, four, five, one after another, two after another, three after another };

for example, other information may be identified: for example, "(the year synchronization is 100%)," (the year synchronization is 100) cells, and "%" are added as a measurement unit.

By identifying and processing the filled content area, the subsequent steps of part-of-speech identification, feature extraction and the like on the data can be conveniently carried out, so that the classification precision is improved.

Optionally, in some possible embodiments, extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable feature words from the words, and storing the extracted variable feature words in the part-of-speech recognition object specifically includes:

extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, determining that each word is a noun, a verb, an adjective or a virtual word, if the word is the virtual word, removing the corresponding word, using the rest words as variable feature words, and storing the extracted variable feature words in a part-of-speech recognition object.

If the particle word is used as the feature word, great noise is brought, so that the efficiency and the accuracy of the subsequent variable classification are directly reduced. Therefore, when the variable features are extracted, the virtual words with small classification use are removed, and words with strong expressive force on variable classification such as real words are used, so that the efficiency and the accuracy of subsequent variable classification can be further improved.

Preferably, only nouns and verbs may be extracted as feature words of variables.

Optionally, in some possible embodiments, extracting the variable feature words into corresponding variable blocks according to a classification rule specifically includes:

when the variable characteristic words are regions or codes, adding the variable characteristic words into the code blocks;

when the variable characteristic words are measuring units, adding the variable characteristic words into the measuring unit blocks;

and when the variable characteristic words are indexes, adding the variable characteristic words into the metering index block.

By adding the variable feature words to different variable blocks according to the types of the variable feature words, accurate variable classification can be achieved.

Optionally, in some possible embodiments, the method for automatically classifying variables based on machine learning further includes:

and cleaning the repeated records in each variable block according to a preset cleaning rule by using the variable word library to construct a standard variable word library.

Repeated record items in the word stock block are cleaned, and versions of word stock storage variables are constructed, so that the subsequent automatic identification method can be conveniently used.

A specific example description is given below in conjunction with fig. 2 and 3.

Firstly, after a report is imported, all characters of the report and a fillable area of the report are identified, a main column and a guest column are determined, part of speech can be identified, the characters on the main column and the guest column are extracted, and are compared with a service word bank to determine whether the characters are variables, namely: indices, groupings, units, etc.

Second, some variables are regular by finding the characteristics of the variables, such as:

the meta-recognition is: [ MEASURING UNIT ] COIN

The gas identification is: energy variables resources.

Electric power identification people: energy variables infrastructure, civilian.

And then finding out the range of the corresponding variable through the operation rule.

Thirdly, through how many variable characteristic words exist in the information in the report main column such as "power, gas and water supply industry", it is determined by the combination of the characteristic words and the indexes in the variable library which approximate rate the possible variables (e.g., indexes, groups.

Finally, according to other contents of the current report, determining the crossed cells of the main guest column, namely which indexes can be marked in the data filling area, and automatically marking the corresponding variable components in the unit cells of the data filling in a marking mode as shown in fig. 3.

When the user fills in the data, the filled-in numbers have the attributes of variables, such as indexes, groups, units.

It is to be understood that in some possible implementations, some other embodiments may include all or part of any of the above-described implementations, as long as they are implemented.

As shown in fig. 4, a structural framework diagram provided by an embodiment of the automatic variable classification system of the present invention is implemented based on machine learning, and is suitable for automatic variable classification of a statistical report, and the automatic variable classification system includes:

the system comprises an acquisition unit 1, a processing unit and a processing unit, wherein the acquisition unit 1 is used for acquiring a report to be processed, extracting text information of the report and storing the identified text information in a text object;

the recognition unit 2 is used for extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable characteristic words from the words, and storing the extracted variable characteristic words in a part-of-speech recognition object;

the matching unit 3 is used for extracting variable characteristic words from the part-of-speech recognition objects, comparing the extracted variable characteristic words with variables in a variable word bank and forming a classification rule for extracting the characteristic words; and the classification unit 4 is used for extracting the variable feature words into corresponding variable blocks according to the classification rules.

The automatic variable classification system provided by this embodiment is implemented based on machine learning, and is suitable for automatic variable classification of statistical reports, and by extracting text information of reports, respectively performing part-of-speech recognition and feature extraction in sequence, then comparing with variables in a variable lexicon, constructing classification rules, and then performing automatic classification according to the classification rules, a method for creating automatic variable classification by using machine learning is implemented, and complicated variable identification work in a data statistics process can be solved.

Optionally, in some possible embodiments, the obtaining unit 1 is specifically configured to obtain a report to be processed, identify all filled content areas in the report, identify data in each cell and determine a data type of the data in each cell, and store the identified data and the data type in the text object.

Optionally, in some possible embodiments, the recognition unit 2 is specifically configured to extract text information from the text object, split the text information into words by using a preset word segmentation algorithm, determine that each word is a noun, a verb, an adjective, or an imaginary word, if the word is an imaginary word, remove the corresponding word, use the remaining words as variable feature words, and store the extracted variable feature words in the part-of-speech recognition object.

Optionally, in some possible embodiments, the classification unit 3 is specifically configured to:

Optionally, in some possible embodiments, the machine learning based variable automatic classification system further includes:

and the cleaning unit is used for cleaning the repeated records in each variable block according to a preset cleaning rule by the variable word bank so as to construct a standard variable word bank.

It should be understood that the above embodiments are product embodiments corresponding to the method embodiments of the present invention, and the technical solutions of the two embodiments correspond, so that the detailed description of the product embodiments may refer to the above method embodiments, and will not be described herein again.

It is to be understood that the present invention may also provide a storage medium having stored therein instructions that, when read by a computer, cause the computer to execute the method for automatic classification of variables based on machine learning according to any of the embodiments described above.

It is to be understood that the present invention may also provide an electronic device comprising:

a memory for storing a computer program;

a processor for executing a computer program to implement the method for automatic classification of variables based on machine learning as described in any of the above embodiments.

The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described method embodiments are merely illustrative, and for example, the division of steps into only one logical functional division may be implemented in practice in another way, for example, multiple steps may be combined or integrated into another step, or some features may be omitted, or not implemented.

The above method, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A variable automatic classification method based on machine learning is characterized by comprising the following steps:

2. The machine learning-based variable automatic classification method according to claim 1, wherein the method comprises the steps of obtaining a report to be processed, extracting text information of the report, and storing the identified text information in a text object, and specifically comprises the steps of:

3. The method according to claim 1, wherein the method for automatically classifying variables based on machine learning is characterized in that the text information is extracted from the text object, the text information is divided into words by using a preset word segmentation algorithm, variable feature words are extracted from the words, and the extracted variable feature words are stored in a part-of-speech recognition object, and specifically comprises:

4. The machine learning-based variable automatic classification method according to claim 1, wherein the extracting the variable feature words into corresponding variable blocks according to the classification rules specifically includes:

5. The machine-learning based variable automatic classification method according to any one of claims 1 to 4, further comprising:

6. A machine learning based variable automatic classification system, comprising:

7. The machine-learning-based variable automatic classification system according to claim 6, wherein the obtaining unit is specifically configured to obtain a report to be processed, identify all filled content areas in the report, identify data in each cell and determine a data type of the data in each cell, and store the identified data and the data type in a text object.

8. The machine-learning-based variable automatic classification system according to claim 6, wherein the recognition unit is specifically configured to extract the text information from the text object, split the text information into words by using a preset word segmentation algorithm, determine that each word is a noun, a verb, an adjective or an imaginary word, if the word is an imaginary word, remove the corresponding word, use the remaining words as variable feature words, and store the extracted variable feature words in a part-of-speech recognition object.

9. The machine-learning based variable automatic classification system according to claim 6, characterized in that the classification unit is specifically configured to:

10. The machine-learning based variable automatic classification system according to any of claims 6 to 9, further comprising: