CN112381143B - Automatic variable classification method and system based on machine learning - Google Patents

Automatic variable classification method and system based on machine learning Download PDF

Info

Publication number
CN112381143B
CN112381143B CN202011272803.4A CN202011272803A CN112381143B CN 112381143 B CN112381143 B CN 112381143B CN 202011272803 A CN202011272803 A CN 202011272803A CN 112381143 B CN112381143 B CN 112381143B
Authority
CN
China
Prior art keywords
variable
feature words
words
extracting
text information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011272803.4A
Other languages
Chinese (zh)
Other versions
CN112381143A (en
Inventor
魏强
孙向学
张上亚
王臣亮
张学敬
翟迪
马静静
郁峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Great Wall Technology Co ltd
Original Assignee
New Great Wall Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New Great Wall Technology Co ltd filed Critical New Great Wall Technology Co ltd
Priority to CN202011272803.4A priority Critical patent/CN112381143B/en
Publication of CN112381143A publication Critical patent/CN112381143A/en
Application granted granted Critical
Publication of CN112381143B publication Critical patent/CN112381143B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a variable automatic classification method and system based on machine learning, and relates to the technical field of information processing. The method comprises the following steps: acquiring a report to be processed, and extracting text information of the report; extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, and extracting variable feature words from the words; extracting variable feature words from the part-of-speech recognition objects, and comparing the extracted variable feature words with variables in a variable word stock to form classification rules for extracting feature words; and extracting the variable feature words into corresponding variable blocks according to the classification rules. The automatic variable classification method provided by the application is realized based on machine learning, is suitable for automatic variable classification of statistical reports, realizes the method for creating automatic variable classification by using machine learning, and can solve the complex variable identification work in the data statistics process.

Description

Automatic variable classification method and system based on machine learning
Technical Field
The application relates to the technical field of information processing, in particular to a variable automatic classification method and system based on machine learning.
Background
At present, when the data of the statistics report forms are integrated, a method of identifying text characters in a main guest bar by human function reality and distinguishing whether variables in the main guest bar are indexes or grouping items is mostly adopted. The error rate of the program identification variable is high, manual verification is needed, the requirement on personnel business level is high, and the problems of manual identification errors and the like are also frequently caused.
Disclosure of Invention
The application aims to solve the technical problem of providing a variable automatic classification method and system based on machine learning aiming at the defects of the prior art.
The technical scheme for solving the technical problems is as follows:
a machine learning based automatic classification method for variables, comprising:
acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object;
extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable feature words from the words, and storing the extracted variable feature words in a part-of-speech recognition object;
extracting the variable feature words from the part-of-speech recognition objects, and comparing the extracted variable feature words with variables in a variable word stock to form classification rules for extracting feature words;
and extracting the variable feature words into corresponding variable blocks according to the classification rules.
The automatic variable classification method provided by the application is realized based on machine learning, is suitable for automatic variable classification of statistical reports, and can solve complex variable identification work in the data statistics process by extracting text information of the reports, sequentially carrying out part-of-speech identification and feature extraction respectively, then comparing the extracted text information with variables in a variable word stock, constructing classification rules, and then carrying out automatic classification according to the classification rules.
Further, the application can be modified as follows:
acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object, wherein the method specifically comprises the following steps:
and acquiring a report to be processed, identifying all filled content areas in the report, identifying data in each cell, judging the data type of the data in each cell, and storing the identified data and the data type in a text object.
The beneficial effects of adopting the further scheme are as follows: by carrying out recognition processing on the filled content area, the following steps of part-of-speech recognition, feature extraction and the like of the data can be conveniently carried out, so that the classification precision is improved.
Further, the application can be modified as follows:
extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable feature words from the words, and storing the extracted variable feature words in a part-of-speech recognition object, wherein the method specifically comprises the following steps of:
extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, determining that each word is a noun, a verb, an adjective or an adjective, if the word is an adjective, removing the corresponding word, taking the rest words as variable feature words, and storing the extracted variable feature words in the part-of-speech recognition object.
The beneficial effects of adopting the further scheme are as follows: if the virtual words are used as the characteristic words, the method brings great noise, so that the efficiency and accuracy of the subsequent variable classification are directly reduced. Therefore, when the variable characteristics are extracted, the virtual words with smaller classification use are removed, and words with stronger expressive force on variable classification such as real words are used, so that the efficiency and accuracy of subsequent variable classification can be further improved.
Further, the application can be modified as follows:
extracting the variable feature words into corresponding variable blocks according to the classification rules, wherein the method specifically comprises the following steps:
when the variable feature words are regions or codes, adding the variable feature words into code blocks;
when the variable feature words are grouped, adding the variable feature words into a grouping block;
when the variable feature words are measurement units, adding the variable feature words into a measurement unit block;
and when the variable feature words are indexes, adding the variable feature words into a metering index block.
The beneficial effects of adopting the further scheme are as follows: by adding variable feature words to different variable blocks according to their types, accurate variable classification can be achieved.
Further, the application can be modified as follows:
the automatic variable classification method based on machine learning further comprises the following steps:
and the variable word stock cleans repeated records in each variable block according to a preset cleaning rule to construct a standard variable word stock.
The beneficial effects of adopting the further scheme are as follows: repeated record items in the word stock blocks are cleaned, and versions of word stock variables are constructed, so that the subsequent automatic identification method can be conveniently used.
The other technical scheme for solving the technical problems is as follows:
a machine learning based variable automatic classification system comprising:
the acquisition unit is used for acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object;
the recognition unit is used for extracting the text information from the text object, splitting the text information into words by utilizing a preset word segmentation algorithm, extracting variable feature words from the words, and storing the extracted variable feature words in the part-of-speech recognition object;
the matching unit is used for extracting the variable feature words from the part-of-speech recognition objects, and comparing the extracted variable feature words with variables in a variable word stock to form classification rules for extracting feature words;
and the classification unit is used for extracting the variable feature words into corresponding variable blocks according to the classification rules.
The automatic variable classification system is realized based on machine learning, is suitable for automatic variable classification of statistical reports, extracts text information of the reports, sequentially performs part-of-speech recognition and feature extraction respectively, compares the extracted text information with variables in a variable word stock, constructs classification rules, and automatically classifies the variable according to the classification rules, so that a method for creating automatic variable classification by machine learning is realized, and complex variable recognition work in a data statistics process can be solved.
Further, the application can be modified as follows:
the acquisition unit is specifically configured to acquire a report to be processed, identify all filled content areas in the report, identify data in each cell, determine a data type of the data in each cell, and store the identified data and data types in a text object.
The beneficial effects of adopting the further scheme are as follows: by carrying out recognition processing on the filled content area, the following steps of part-of-speech recognition, feature extraction and the like of the data can be conveniently carried out, so that the classification precision is improved.
Further, the application can be modified as follows:
the recognition unit is specifically configured to extract the text information from the text object, split the text information into terms by using a preset word segmentation algorithm, determine that each term is a noun, a verb, an adjective or an imaginary term, reject the corresponding term if the term is the imaginary term, take the remaining terms as variable feature words, and store the extracted variable feature words in the part-of-speech recognition object.
The beneficial effects of adopting the further scheme are as follows: if the virtual words are used as the characteristic words, the method brings great noise, so that the efficiency and accuracy of the subsequent variable classification are directly reduced. Therefore, when the variable characteristics are extracted, the virtual words with smaller classification use are removed, and words with stronger expressive force on variable classification such as real words are used, so that the efficiency and accuracy of subsequent variable classification can be further improved.
Further, the application can be modified as follows:
the classifying unit is specifically used for:
when the variable feature words are regions or codes, adding the variable feature words into code blocks;
when the variable feature words are grouped, adding the variable feature words into a grouping block;
when the variable feature words are measurement units, adding the variable feature words into a measurement unit block;
and when the variable feature words are indexes, adding the variable feature words into a metering index block.
The beneficial effects of adopting the further scheme are as follows: by adding variable feature words to different variable blocks according to their types, accurate variable classification can be achieved.
Further, the application can be modified as follows:
the automatic variable classification system based on machine learning further comprises:
and the cleaning unit is used for enabling the variable word stock to clean repeated records in each variable block according to a preset cleaning rule, and constructing a standard variable word stock.
The beneficial effects of adopting the further scheme are as follows: repeated record items in the word stock blocks are cleaned, and versions of word stock variables are constructed, so that the subsequent automatic identification method can be conveniently used.
Additional aspects of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of an automatic variable classification method according to the present application;
FIG. 2 is a diagram of data according to an embodiment of the present application;
FIG. 3 is another schematic diagram of data in an embodiment of the present application;
FIG. 4 is a block diagram of an embodiment of an automatic variable classification system according to the present application.
Detailed Description
The principles and features of the present application are described below with reference to the drawings, the illustrated embodiments are provided for illustration only and are not intended to limit the scope of the present application.
As shown in fig. 1, a flow chart is provided for an embodiment of the automatic variable classification method of the present application, where the automatic variable classification method is implemented based on machine learning and is applicable to automatic variable classification of statistical type report forms, and the automatic variable classification method includes:
s1, acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object;
it should be noted that, the text information of the report may be text in the main column, for example, as shown in fig. 2, an exemplary report schematic diagram is given, the left half of the report is the main column, the data information of the report is recorded, the top of the report is the column, and the statistical mode of the report corresponding to the data information is recorded, that is, in a report table, the intersection area of the main column and the column is a filling area, usually the filled data, and the indexes corresponding to the data are determined by the contents of the main column and the table head.
S2, extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable feature words from the words, and storing the extracted variable feature words in the part-of-speech recognition object;
it should be noted that, the word segmentation algorithm is used for splitting the text into words and distinguishing nouns, verbs, adjectives, and imaginary words, where the imaginary words may include exclamation words, prepositions, conjunctions, and the like, which may be implemented by the existing procedure and are not described herein again.
Since the method takes the virtual words as the characteristic words to bring great noise, the efficiency and accuracy of the classification of the variables at the back are directly reduced, therefore, when the variable characteristics are extracted, firstly, the virtual words which are not used for classification are considered to be removed, and in the real words, the expressive force of nouns and verbs on the variable classification is the strongest, so that the method can preferably extract only the nouns and the verbs as the characteristic words of the variables.
S3, extracting variable feature words from the part-of-speech recognition objects, and comparing the extracted variable feature words with variables in a variable word stock to form classification rules for extracting feature words;
it should be understood that if the corresponding variable feature word does not exist in the variable word stock, the feature word is placed in the object to be processed.
It should be noted that, for the variable comparison of the single feature word and the variable word stock, it can be determined whether the feature word and the element in the variable word stock are completely matched.
For the matching of unidentified single feature words and elements in a variable word stock, a plurality of feature word combinations can be compared with the variable word stock, and the multi-feature words are combined for a plurality of times and matched with the elements in the variable word stock.
For example, assuming that there are two identified feature words, "plan" and "total investment," respectively, and that this feature word is not matched from the variable word stock for "plan," then "plan" and "total investment" may be combined to obtain "plan total investment," which is matched again using the word.
Then, the matched feature words and the unmatched feature words are processed by using NLP (natural language processing) through intelligent recognition, for example, the feature words can be divided into variable blocks such as { codes, groups, regions, measurement units, indexes }, and the recognized feature words are extracted. And are thus classified according to these variables.
And S4, extracting the variable feature words into corresponding variable blocks according to the classification rules.
For example, after extracting the feature words, the feature words are compared with classification rules in the variable word stock to form the following rules:
when the feature word is a region or code: text information is added to the code blocks.
When the feature words are grouped: the feature words are added to the grouping blocks.
When the feature word is a measurement unit: the feature words are added to their metering unit blocks.
When the feature words are indexes: the feature words are added to their metric blocks.
The automatic variable classification method provided by the embodiment is realized based on machine learning, is suitable for automatic variable classification of statistical reports, and can solve the complex variable identification work in the data statistics process by extracting text information of the reports, sequentially carrying out part-of-speech identification and feature extraction respectively, then comparing the extracted text information with variables in a variable word stock, constructing classification rules, and then carrying out automatic classification according to the classification rules.
Optionally, in some possible embodiments, the method includes obtaining a report to be processed, extracting text information of the report, and storing the identified text information in a text object, including:
the method comprises the steps of obtaining a report to be processed, identifying all filled content areas in the report, identifying data in each cell, judging data types of the data in each cell, and storing the identified data and the data types in a text object.
Specifically, all the filled content areas in the report can be identified, and the type judgment of filling data of each unit cell is carried out on the content areas.
The method can be specifically judged according to the following rules:
for example, the content before the colon may be deleted: such as "wherein: "in the aggregate: "etc.;
for example, the units of measure in text brackets may be identified: such as: "ten thousand yuan, hundred million yuan, ton" etc.;
for example, a reporting period in text brackets may be identified: identifying through regular matching;
for example, known useless remark information may be deleted: such as: { one, two, three, four, five, continuous one, continuous two, continuous three };
for example, other information may be identified: such as "(100% of the year in the same period) and (100% of the year in the same period), and the like.
By carrying out recognition processing on the filled content area, the following steps of part-of-speech recognition, feature extraction and the like of the data can be conveniently carried out, so that the classification precision is improved.
Optionally, in some possible embodiments, text information is extracted from the text object, the text information is split into words by using a preset word segmentation algorithm, variable feature words are extracted from the words, and the extracted variable feature words are stored in the part-of-speech recognition object, which specifically includes:
extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, determining that each word is a noun, a verb, an adjective or an adjective, if the word is an adjective, removing the corresponding word, taking the rest words as variable feature words, and storing the extracted variable feature words in the part-of-speech recognition object.
If the virtual words are used as the characteristic words, the method brings great noise, so that the efficiency and accuracy of the subsequent variable classification are directly reduced. Therefore, when the variable characteristics are extracted, the virtual words with smaller classification use are removed, and words with stronger expressive force on variable classification such as real words are used, so that the efficiency and accuracy of subsequent variable classification can be further improved.
Preferably, only nouns and verbs may be extracted as feature words of variables.
Optionally, in some possible embodiments, extracting the variable feature words into the corresponding variable blocks according to the classification rule specifically includes:
when the variable feature words are regions or codes, adding the variable feature words into the code blocks;
when the variable feature words are grouping, adding the variable feature words into a grouping block;
when the variable feature words are measurement units, adding the variable feature words into a measurement unit block;
when the variable feature words are indexes, the variable feature words are added into the metering index block.
By adding variable feature words to different variable blocks according to their types, accurate variable classification can be achieved.
Optionally, in some possible embodiments, the automatic classification method of variables based on machine learning further includes:
and cleaning repeated records in each variable block according to a preset cleaning rule by the variable word stock to construct a standard variable word stock.
Repeated record items in the word stock blocks are cleaned, and versions of word stock variables are constructed, so that the subsequent automatic identification method can be conveniently used.
Specific examples are given below in connection with fig. 2 and 3.
Firstly, after the report is imported, firstly identifying all characters of the report and the fillable area of the report, determining a main column and a guest column, and identifying parts of speech, extracting characters on the main guest column, comparing the characters with a business word stock, and determining whether the characters are variables, namely: index, grouping, units, etc.
Second, some variables are regular by finding features of the variables, such as:
the meta-recognition is: [ MEANS FOR METHOD ] Currency
The gas identification is as follows: energy variable the resource.
Electric power identification person: infrastructure, civilian, energy variable.
Then find out which range the corresponding variable is in through the operation rule.
Again, by reporting how many variable feature words are in the main guest column, such as "power, gas and water supply" and by combining feature words with the indices in the variable library, it is determined with which high probability the combination is likely to match (e.g., index, group.
Finally, determining the crossing unit cells of the main guest column, namely which indexes are marked in the data filling area according to other contents of the current report, and marking the corresponding variable components in a marking mode automatically as shown in fig. 3.
When the user is filling in data, the filled-in number has the properties of the variables, such as index, group, unit.
It is to be understood that in some possible implementations, some other examples may include all or part of any of the implementations described above, as may be practical.
As shown in fig. 4, a structural framework diagram is provided for an embodiment of the automatic variable classification system of the present application, which is implemented based on machine learning and is suitable for automatic classification of variables of statistical type report forms, and the automatic variable classification system includes:
the acquisition unit 1 is used for acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object;
the recognition unit 2 is used for extracting text information from the text object, splitting the text information into words by utilizing a preset word segmentation algorithm, extracting variable feature words from the words, and storing the extracted variable feature words in the part-of-speech recognition object;
the matching unit 3 is used for extracting variable feature words from the part-of-speech recognition objects, and comparing the extracted variable feature words with variables in a variable word stock to form classification rules for extracting feature words; and the classification unit 4 is used for extracting the variable characteristic words into corresponding variable blocks according to the classification rules.
The variable automatic classification system provided by the embodiment is realized based on machine learning, is suitable for automatic classification of variables of statistical reports, and can solve complex variable identification work in the data statistics process by extracting text information of the reports, sequentially carrying out part-of-speech identification and feature extraction respectively, then comparing the extracted text information with variables in a variable word stock, constructing classification rules, and then carrying out automatic classification according to the classification rules.
Optionally, in some possible embodiments, the obtaining unit 1 is specifically configured to obtain a report to be processed, identify all the filled content areas in the report, identify data in each cell, determine a data type of the data in each cell, and store the identified data and the data type in the text object.
By carrying out recognition processing on the filled content area, the following steps of part-of-speech recognition, feature extraction and the like of the data can be conveniently carried out, so that the classification precision is improved.
Optionally, in some possible embodiments, the recognition unit 2 is specifically configured to extract text information from the text object, split the text information into terms by using a preset word segmentation algorithm, determine that each term is a noun, a verb, an adjective or an imaginary term, if the term is an imaginary term, reject the corresponding term, use the remaining terms as variable feature terms, and store the extracted variable feature terms in the part-of-speech recognition object.
If the virtual words are used as the characteristic words, the method brings great noise, so that the efficiency and accuracy of the subsequent variable classification are directly reduced. Therefore, when the variable characteristics are extracted, the virtual words with smaller classification use are removed, and words with stronger expressive force on variable classification such as real words are used, so that the efficiency and accuracy of subsequent variable classification can be further improved.
Alternatively, in some possible embodiments, the classification unit 3 is specifically configured to:
when the variable feature words are regions or codes, adding the variable feature words into the code blocks;
when the variable feature words are grouping, adding the variable feature words into a grouping block;
when the variable feature words are measurement units, adding the variable feature words into a measurement unit block;
when the variable feature words are indexes, the variable feature words are added into the metering index block.
By adding variable feature words to different variable blocks according to their types, accurate variable classification can be achieved.
Optionally, in some possible embodiments, the machine learning based variable automatic classification system further comprises:
and the cleaning unit is used for enabling the variable word stock to clean repeated records in each variable block according to a preset cleaning rule and constructing a standard variable word stock.
Repeated record items in the word stock blocks are cleaned, and versions of word stock variables are constructed, so that the subsequent automatic identification method can be conveniently used.
It is to be understood that in some possible implementations, some other examples may include all or part of any of the implementations described above, as may be practical.
It should be understood that the above embodiments are product embodiments corresponding to the method embodiments of the present application, and the technical solutions of the two embodiments correspond to each other, so that specific descriptions of the above product embodiments may refer to the above method embodiments, and are not repeated herein.
It will be appreciated that the present application may also provide a storage medium having stored therein instructions which, when read by a computer, cause the computer to perform the machine learning based variable automatic classification method of any of the above embodiments.
It can be appreciated that the present application may also provide an electronic device including:
a memory for storing a computer program;
and a processor for executing a computer program to implement the automatic classification method of variables based on machine learning according to any of the above embodiments.
The reader will appreciate that in the description of this specification, a description of the terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the method embodiments described above are merely illustrative, e.g., the division of steps is merely a logical function division, and there may be additional divisions of actual implementation, e.g., multiple steps may be combined or integrated into another step, or some features may be omitted or not performed.
The above-described method, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the present application, and these modifications and substitutions are intended to be included in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (6)

1. A machine learning based automatic classification method for variables, comprising:
acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object;
extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable feature words from the words, and storing the extracted variable feature words in a part-of-speech recognition object;
extracting the variable feature words from the part-of-speech recognition objects, and comparing the extracted variable feature words with variables in a variable word stock to form classification rules for extracting feature words;
extracting the variable feature words into corresponding variable blocks according to the classification rules;
acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object, wherein the method specifically comprises the following steps:
acquiring a report to be processed, identifying all filled content areas in the report, identifying data in each cell, judging the data type of the data in each cell, and storing the identified data and the data type in a text object;
extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable feature words from the words, and storing the extracted variable feature words in a part-of-speech recognition object, wherein the method specifically comprises the following steps of:
extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, determining that each word is a noun, a verb, an adjective or an adjective, if the word is an adjective, removing the corresponding word, taking the rest words as variable feature words, and storing the extracted variable feature words in the part-of-speech recognition object.
2. The automatic classification method of variable based on machine learning according to claim 1, wherein extracting the variable feature words into corresponding variable blocks according to the classification rule specifically comprises:
when the variable feature words are regions or codes, adding the variable feature words into code blocks;
when the variable feature words are grouped, adding the variable feature words into a grouping block;
when the variable feature words are measurement units, adding the variable feature words into a measurement unit block;
and when the variable feature words are indexes, adding the variable feature words into a metering index block.
3. The machine learning based variable automatic classification method according to claim 1 or 2, further comprising:
and the variable word stock cleans repeated records in each variable block according to a preset cleaning rule to construct a standard variable word stock.
4. A machine learning based automatic variable classification system comprising:
the acquisition unit is used for acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object;
the recognition unit is used for extracting the text information from the text object, splitting the text information into words by utilizing a preset word segmentation algorithm, extracting variable feature words from the words, and storing the extracted variable feature words in the part-of-speech recognition object;
the matching unit is used for extracting the variable feature words from the part-of-speech recognition objects, and comparing the extracted variable feature words with variables in a variable word stock to form classification rules for extracting feature words;
the classification unit is used for extracting the variable feature words into corresponding variable blocks according to the classification rules;
the acquisition unit is specifically used for acquiring a report to be processed, identifying all filled content areas in the report, identifying data in each cell, judging the data type of the data in each cell, and storing the identified data and data types in a text object;
the recognition unit is specifically configured to extract the text information from the text object, split the text information into terms by using a preset word segmentation algorithm, determine that each term is a noun, a verb, an adjective or an imaginary term, reject the corresponding term if the term is the imaginary term, take the remaining terms as variable feature words, and store the extracted variable feature words in the part-of-speech recognition object.
5. The automatic classification system of machine learning based variables of claim 4, wherein the classification unit is specifically configured to:
when the variable feature words are regions or codes, adding the variable feature words into code blocks;
when the variable feature words are grouped, adding the variable feature words into a grouping block;
when the variable feature words are measurement units, adding the variable feature words into a measurement unit block;
and when the variable feature words are indexes, adding the variable feature words into a metering index block.
6. The machine learning based variable automatic classification system of claim 4 or 5, further comprising:
and the cleaning unit is used for enabling the variable word stock to clean repeated records in each variable block according to a preset cleaning rule, and constructing a standard variable word stock.
CN202011272803.4A 2020-11-13 2020-11-13 Automatic variable classification method and system based on machine learning Active CN112381143B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011272803.4A CN112381143B (en) 2020-11-13 2020-11-13 Automatic variable classification method and system based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011272803.4A CN112381143B (en) 2020-11-13 2020-11-13 Automatic variable classification method and system based on machine learning

Publications (2)

Publication Number Publication Date
CN112381143A CN112381143A (en) 2021-02-19
CN112381143B true CN112381143B (en) 2023-12-05

Family

ID=74583933

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011272803.4A Active CN112381143B (en) 2020-11-13 2020-11-13 Automatic variable classification method and system based on machine learning

Country Status (1)

Country Link
CN (1) CN112381143B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006031143A (en) * 2004-07-13 2006-02-02 Fuji Xerox Co Ltd Document analysis device, document analysis method, and computer program
CN106709032A (en) * 2016-12-29 2017-05-24 深圳市华傲数据技术有限公司 Method and device for extracting structured information from spreadsheet document
CN109710725A (en) * 2018-12-13 2019-05-03 中国科学院信息工程研究所 A kind of Chinese table column label restoration methods and system based on text classification
CN110728240A (en) * 2019-10-14 2020-01-24 北京华宇信息技术有限公司 Method and device for automatically identifying title of electronic file
CN110866217A (en) * 2019-10-24 2020-03-06 长城计算机软件与系统有限公司 Cross report recognition method and device, storage medium and electronic equipment
CN110929520A (en) * 2019-11-25 2020-03-27 北京明略软件系统有限公司 Non-named entity object extraction method and device, electronic equipment and storage medium
CN111291562A (en) * 2020-01-17 2020-06-16 中国石油集团安全环保技术研究院有限公司 Intelligent semantic recognition method based on HSE
KR102128852B1 (en) * 2020-03-30 2020-07-01 (주)위세아이텍 Device and method for visualizing key words of features extracted by applying principal component analysis to word vectors from text data

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160103823A1 (en) * 2014-10-10 2016-04-14 The Trustees Of Columbia University In The City Of New York Machine Learning Extraction of Free-Form Textual Rules and Provisions From Legal Documents
US11176364B2 (en) * 2019-03-19 2021-11-16 Hyland Software, Inc. Computing system for extraction of textual elements from a document

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006031143A (en) * 2004-07-13 2006-02-02 Fuji Xerox Co Ltd Document analysis device, document analysis method, and computer program
CN106709032A (en) * 2016-12-29 2017-05-24 深圳市华傲数据技术有限公司 Method and device for extracting structured information from spreadsheet document
CN109710725A (en) * 2018-12-13 2019-05-03 中国科学院信息工程研究所 A kind of Chinese table column label restoration methods and system based on text classification
CN110728240A (en) * 2019-10-14 2020-01-24 北京华宇信息技术有限公司 Method and device for automatically identifying title of electronic file
CN110866217A (en) * 2019-10-24 2020-03-06 长城计算机软件与系统有限公司 Cross report recognition method and device, storage medium and electronic equipment
CN110929520A (en) * 2019-11-25 2020-03-27 北京明略软件系统有限公司 Non-named entity object extraction method and device, electronic equipment and storage medium
CN111291562A (en) * 2020-01-17 2020-06-16 中国石油集团安全环保技术研究院有限公司 Intelligent semantic recognition method based on HSE
KR102128852B1 (en) * 2020-03-30 2020-07-01 (주)위세아이텍 Device and method for visualizing key words of features extracted by applying principal component analysis to word vectors from text data

Also Published As

Publication number Publication date
CN112381143A (en) 2021-02-19

Similar Documents

Publication Publication Date Title
CN106776574B (en) User comment text mining method and device
CN107844559A (en) A kind of file classifying method, device and electronic equipment
Vivaldi et al. Improving term extraction by system combination using boosting
CN108664538B (en) Automatic identification method and system for suspected familial defects of power transmission and transformation equipment
JP2010015571A (en) Automated evaluation of overly repetitive word use in essay
CN110472203B (en) Article duplicate checking and detecting method, device, equipment and storage medium
CN109101483B (en) Error identification method for power inspection text
CN106126719A (en) Information processing method and device
CN110321466A (en) A kind of security information duplicate checking method and system based on semantic analysis
CN111563384A (en) Evaluation object identification method and device for E-commerce products and storage medium
CN111191051B (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN112329055A (en) Method and device for desensitizing user data, electronic equipment and storage medium
CN111797247B (en) Case pushing method and device based on artificial intelligence, electronic equipment and medium
CN111899090A (en) Enterprise associated risk early warning method and system
CN116304020A (en) Industrial text entity extraction method based on semantic source analysis and span characteristics
CN112381143B (en) Automatic variable classification method and system based on machine learning
Prieto et al. Text content based layout analysis
CN113065329A (en) Data processing method and device
CN115017264A (en) Model effect verification method and device
CN108875060A (en) A kind of website identification method and identifying system
CN105550172B (en) A kind of distributed text detection method and system
CN112668284B (en) Legal document segmentation method and system
CN114943285A (en) Intelligent auditing system for internet news content data
CN114049215A (en) Abnormal transaction identification method, device and application
CN113505117A (en) Data quality evaluation method, device, equipment and medium based on data indexes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100190 17-19 / F, building a 1, 66 Zhongguancun East Road, Haidian District, Beijing

Applicant after: New Great Wall Technology Co.,Ltd.

Address before: 100190 17-19 / F, building a 1, 66 Zhongguancun East Road, Haidian District, Beijing

Applicant before: GREAT WALL COMPUTER SOFTWARE & SYSTEMS Inc.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant