CN112381143A - Variable automatic classification method and system based on machine learning - Google Patents

Variable automatic classification method and system based on machine learning Download PDF

Info

Publication number
CN112381143A
CN112381143A CN202011272803.4A CN202011272803A CN112381143A CN 112381143 A CN112381143 A CN 112381143A CN 202011272803 A CN202011272803 A CN 202011272803A CN 112381143 A CN112381143 A CN 112381143A
Authority
CN
China
Prior art keywords
variable
words
extracting
characteristic words
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011272803.4A
Other languages
Chinese (zh)
Other versions
CN112381143B (en
Inventor
魏强
孙向学
张上亚
王臣亮
张学敬
翟迪
马静静
郁峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Great Wall Computer Software & Systems Inc
Original Assignee
Great Wall Computer Software & Systems Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Great Wall Computer Software & Systems Inc filed Critical Great Wall Computer Software & Systems Inc
Priority to CN202011272803.4A priority Critical patent/CN112381143B/en
Publication of CN112381143A publication Critical patent/CN112381143A/en
Application granted granted Critical
Publication of CN112381143B publication Critical patent/CN112381143B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a variable automatic classification method and system based on machine learning, and relates to the technical field of information processing. The method comprises the following steps: acquiring a report to be processed, and extracting text information of the report; extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, and extracting variable characteristic words from the words; extracting variable characteristic words from the part-of-speech recognition object, and comparing the extracted variable characteristic words with variables in a variable word bank to form a classification rule for extracting the characteristic words; and extracting the variable feature words into corresponding variable blocks according to the classification rules. The automatic variable classification method provided by the invention is realized based on machine learning, is suitable for automatic variable classification of statistical reports, realizes a method for establishing automatic variable classification by utilizing machine learning, and can solve the problem of complex variable identification work in the data statistics process.

Description

Variable automatic classification method and system based on machine learning
Technical Field
The invention relates to the technical field of information processing, in particular to a variable automatic classification method and system based on machine learning.
Background
At present, when data of a statistical report is sorted, a method of identifying text characters in a main guest column and distinguishing whether variables in the main guest column are indexes or grouped items is mostly adopted. The error rate of variable identification by a program is high, manual check is needed, the requirement on the personnel service level is high, and the problems of manual identification errors and the like often occur.
Disclosure of Invention
The invention aims to solve the technical problem of the prior art and provides a variable automatic classification method and system based on machine learning.
The technical scheme for solving the technical problems is as follows:
a variable automatic classification method based on machine learning comprises the following steps:
acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object;
extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable characteristic words from the words, and storing the extracted variable characteristic words in a part-of-speech recognition object;
extracting the variable characteristic words from the part-of-speech recognition object, and comparing the extracted variable characteristic words with variables in a variable word bank to form a classification rule for extracting the characteristic words;
and extracting the variable feature words into corresponding variable blocks according to the classification rules.
The method for automatically classifying the variables is realized based on machine learning, is suitable for automatically classifying the variables of the statistical report, extracts the text information of the report, respectively performs part-of-speech recognition and feature extraction in sequence, compares the text information with the variables in the variable lexicon to construct classification rules, and automatically classifies according to the classification rules, so that the method for establishing the automatic classification of the variables by using machine learning is realized, and the complicated variable recognition work in the data statistics process can be solved.
Further, the invention can be improved as follows:
the method comprises the steps of obtaining a report to be processed, extracting text information of the report, and storing the identified text information in a text object, and specifically comprises the following steps:
the method comprises the steps of obtaining a report to be processed, identifying all filled content areas in the report, identifying data in each cell, judging the data type of the data in each cell, and storing the identified data and the data type in a text object.
The beneficial effect of adopting the further scheme is that: by identifying and processing the filled content area, the subsequent steps of part-of-speech identification, feature extraction and the like on the data can be conveniently carried out, so that the classification precision is improved.
Further, the invention can be improved as follows:
extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable characteristic words from the words, and storing the extracted variable characteristic words in a part-of-speech recognition object, wherein the method specifically comprises the following steps:
extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, determining that each word is a noun, a verb, an adjective or a virtual word, if the word is the virtual word, removing the corresponding word, taking the rest words as variable characteristic words, and storing the extracted variable characteristic words in a part-of-speech recognition object.
The beneficial effect of adopting the further scheme is that: if the particle word is used as the feature word, great noise is brought, so that the efficiency and the accuracy of the subsequent variable classification are directly reduced. Therefore, when the variable features are extracted, the virtual words with small classification use are removed, and words with strong expressive force on variable classification such as real words are used, so that the efficiency and the accuracy of subsequent variable classification can be further improved.
Further, the invention can be improved as follows:
extracting the variable feature words into corresponding variable blocks according to the classification rules, which specifically comprises:
when the variable characteristic words are regions or codes, adding the variable characteristic words into a code block;
when the variable characteristic words are the groups, adding the variable characteristic words into the grouping blocks;
when the variable characteristic words are measuring units, adding the variable characteristic words into a measuring unit block;
and when the variable characteristic words are indexes, adding the variable characteristic words into a metering index block.
The beneficial effect of adopting the further scheme is that: by adding the variable feature words to different variable blocks according to the types of the variable feature words, accurate variable classification can be achieved.
Further, the invention can be improved as follows:
the automatic variable classification method based on machine learning further comprises the following steps:
and the variable word bank cleans repeated records in each variable block according to a preset cleaning rule to construct a standard variable word bank.
The beneficial effect of adopting the further scheme is that: repeated record items in the word stock block are cleaned, and versions of word stock storage variables are constructed, so that the subsequent automatic identification method can be conveniently used.
Another technical solution of the present invention for solving the above technical problems is as follows:
a machine learning based variable automatic classification system comprising:
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a report to be processed, extracting text information of the report and storing the identified text information in a text object;
the recognition unit is used for extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable characteristic words from the words, and storing the extracted variable characteristic words in a part-of-speech recognition object;
the matching unit is used for extracting the variable characteristic words from the part-of-speech recognition objects, comparing the extracted variable characteristic words with variables in a variable word bank and forming a classification rule for extracting the characteristic words;
and the classification unit is used for extracting the variable feature words into corresponding variable blocks according to the classification rules.
The automatic variable classification system provided by the invention is realized based on machine learning, is suitable for automatic variable classification of statistical reports, extracts the text information of the reports, respectively performs part-of-speech recognition and feature extraction in sequence, compares the text information with the variables in the variable lexicon to construct classification rules, and automatically classifies according to the classification rules, so that the method for establishing automatic variable classification by using machine learning is realized, and the complicated variable identification work in the data statistics process can be solved.
Further, the invention can be improved as follows:
the acquisition unit is specifically used for acquiring the report to be processed, identifying all filled content areas in the report, identifying the data in each cell and judging the data type of the data in each cell, and storing the identified data and the data type in the text object.
The beneficial effect of adopting the further scheme is that: by identifying and processing the filled content area, the subsequent steps of part-of-speech identification, feature extraction and the like on the data can be conveniently carried out, so that the classification precision is improved.
Further, the invention can be improved as follows:
the recognition unit is specifically configured to extract the text information from the text object, split the text information into words by using a preset word segmentation algorithm, determine that each word is a noun, a verb, an adjective or an imaginary word, if the word is an imaginary word, remove the corresponding word, use the remaining words as variable feature words, and store the extracted variable feature words in a part-of-speech recognition object.
The beneficial effect of adopting the further scheme is that: if the particle word is used as the feature word, great noise is brought, so that the efficiency and the accuracy of the subsequent variable classification are directly reduced. Therefore, when the variable features are extracted, the virtual words with small classification use are removed, and words with strong expressive force on variable classification such as real words are used, so that the efficiency and the accuracy of subsequent variable classification can be further improved.
Further, the invention can be improved as follows:
the classification unit is specifically configured to:
when the variable characteristic words are regions or codes, adding the variable characteristic words into a code block;
when the variable characteristic words are the groups, adding the variable characteristic words into the grouping blocks;
when the variable characteristic words are measuring units, adding the variable characteristic words into a measuring unit block;
and when the variable characteristic words are indexes, adding the variable characteristic words into a metering index block.
The beneficial effect of adopting the further scheme is that: by adding the variable feature words to different variable blocks according to the types of the variable feature words, accurate variable classification can be achieved.
Further, the invention can be improved as follows:
the machine learning based variable automatic classification system further comprises:
and the cleaning unit is used for cleaning the repeated records in each variable block by the variable word bank according to a preset cleaning rule to construct a standard variable word bank.
The beneficial effect of adopting the further scheme is that: repeated record items in the word stock block are cleaned, and versions of word stock storage variables are constructed, so that the subsequent automatic identification method can be conveniently used.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 is a schematic flow chart diagram of a method for automatic classification of variables according to an embodiment of the present invention;
FIG. 2 is a data diagram of an embodiment of the present invention;
FIG. 3 is another data diagram in accordance with an embodiment of the present invention;
FIG. 4 is a structural framework diagram provided by an embodiment of the automatic variable classification system of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth to illustrate, but are not to be construed to limit the scope of the invention.
As shown in fig. 1, a schematic flow chart provided by an embodiment of the automatic variable classification method according to the present invention is implemented based on machine learning, and is suitable for automatic variable classification of a statistical report, and the automatic variable classification method includes:
s1, acquiring the report to be processed, extracting the text information of the report, and storing the identified text information in the text object;
it should be noted that the text information of the report may be text words in a main column, for example, as shown in fig. 2, an exemplary report diagram is provided, a left half of the report is a main column, which records data information of the report, a top of the report is a guest column, which records a statistical manner of data information corresponding to the report, that is, in a report, an intersection area of the main column and the guest column is a filling area, which is usually filled data, and which indexes correspond to the data are determined by contents of the main column and a header.
S2, extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable characteristic words from the words, and storing the extracted variable characteristic words in a part-of-speech recognition object;
it should be noted that the word segmentation algorithm is used to segment the text into words and phrases and distinguish nouns, verbs, adjectives, fictional words, and the like, where the fictional words may include exclamatory words, prepositions, conjunctions, and the like, which may be implemented by existing programs and will not be described herein again.
Since the virtual words as the feature words will bring much noise, thereby directly reducing the efficiency and accuracy of the variable classification after the virtual words are classified, when extracting the variable features, the virtual words which are not useful for classification should be removed first, and in the real words, the expressive force of nouns and verbs on the variable classification is strongest, so that the feature words which only can extract nouns and verbs as variables are preferred.
S3, extracting variable characteristic words from the part-of-speech recognition objects, and comparing the extracted variable characteristic words with variables in a variable word bank to form a classification rule for extracting the characteristic words;
it should be understood that if no corresponding variable feature word exists in the variable word library, the feature word is put into the object to be processed.
It should be noted that, for the comparison between the single feature word and the variable in the variable lexicon, it can be determined whether the feature word is completely matched with the element in the variable lexicon.
For the matching of the unidentified single characteristic words and the elements in the variable word stock, a plurality of characteristic word combinations can be compared with the variable word stock, and the multi-characteristic words are combined for multiple times and matched with the elements in the variable word stock.
For example, assuming that there are two feature words identified, respectively "plan" and "total investment", assuming that this feature word is not matched from the variable lexicon for "plan", then "plan" and "total investment" can be combined to get "plan total investment", and the word used to match again.
Then, the matched feature words and the unmatched feature words are processed by using NLP (natural language processing) through intelligent recognition, for example, the matched feature words and the unmatched feature words can be divided into variable blocks such as { coding, grouping, region, measurement unit, index } and the like, and the recognized feature words are extracted. And thus classified according to the variable blocks.
And S4, extracting the variable feature words into corresponding variable blocks according to the classification rules.
For example, after extracting the feature words, the feature words are compared with the classification rules in the variable lexicon to form the following rules:
when the characteristic words are regions or codes: the text information is added to the code block.
When the characteristic words are grouped: the feature words are added to the chunked.
When the characteristic word is a measurement unit: the feature words are added to their unit of measure blocks.
When the characteristic words are indexes: the feature words are added to their metering index blocks.
The method for automatically classifying variables provided by the embodiment is realized based on machine learning, is suitable for automatically classifying the variables of the statistical report, extracts the text information of the report, respectively performs part-of-speech recognition and feature extraction in sequence, compares the extracted text information with the variables in the variable lexicon to construct classification rules, and automatically classifies according to the classification rules, so that the method for establishing the automatic classification of the variables by using machine learning is realized, and the complicated variable recognition work in the data statistics process can be solved.
Optionally, in some possible embodiments, the obtaining of the report to be processed, extracting the text information of the report, and storing the identified text information in the text object specifically includes:
the method comprises the steps of obtaining a report to be processed, identifying all filled content areas in the report, identifying data in each cell, judging the data type of the data in each cell, and storing the identified data and the data type in a text object.
Specifically, all the filled content areas in the report can be identified, and the type of the filled data of the content areas is judged one by one.
Specifically, the following rules can be used for judgment:
for example, the content before the colon can be deleted: such as "wherein: "," in total: "and the like;
for example, the units of measure in text brackets may be identified: such as: "ten thousand yuan, hundred million yuan, ton", etc.;
for example, a report period in text brackets may be identified: identifying through regular matching;
for example, known unwanted remark information may be deleted: such as: { one, two, three, four, five, one after another, two after another, three after another };
for example, other information may be identified: for example, "(the year synchronization is 100%)," (the year synchronization is 100) cells, and "%" are added as a measurement unit.
By identifying and processing the filled content area, the subsequent steps of part-of-speech identification, feature extraction and the like on the data can be conveniently carried out, so that the classification precision is improved.
Optionally, in some possible embodiments, extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable feature words from the words, and storing the extracted variable feature words in the part-of-speech recognition object specifically includes:
extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, determining that each word is a noun, a verb, an adjective or a virtual word, if the word is the virtual word, removing the corresponding word, using the rest words as variable feature words, and storing the extracted variable feature words in a part-of-speech recognition object.
If the particle word is used as the feature word, great noise is brought, so that the efficiency and the accuracy of the subsequent variable classification are directly reduced. Therefore, when the variable features are extracted, the virtual words with small classification use are removed, and words with strong expressive force on variable classification such as real words are used, so that the efficiency and the accuracy of subsequent variable classification can be further improved.
Preferably, only nouns and verbs may be extracted as feature words of variables.
Optionally, in some possible embodiments, extracting the variable feature words into corresponding variable blocks according to a classification rule specifically includes:
when the variable characteristic words are regions or codes, adding the variable characteristic words into the code blocks;
when the variable characteristic words are the groups, adding the variable characteristic words into the grouping blocks;
when the variable characteristic words are measuring units, adding the variable characteristic words into the measuring unit blocks;
and when the variable characteristic words are indexes, adding the variable characteristic words into the metering index block.
By adding the variable feature words to different variable blocks according to the types of the variable feature words, accurate variable classification can be achieved.
Optionally, in some possible embodiments, the method for automatically classifying variables based on machine learning further includes:
and cleaning the repeated records in each variable block according to a preset cleaning rule by using the variable word library to construct a standard variable word library.
Repeated record items in the word stock block are cleaned, and versions of word stock storage variables are constructed, so that the subsequent automatic identification method can be conveniently used.
A specific example description is given below in conjunction with fig. 2 and 3.
Firstly, after a report is imported, all characters of the report and a fillable area of the report are identified, a main column and a guest column are determined, part of speech can be identified, the characters on the main column and the guest column are extracted, and are compared with a service word bank to determine whether the characters are variables, namely: indices, groupings, units, etc.
Second, some variables are regular by finding the characteristics of the variables, such as:
the meta-recognition is: [ MEASURING UNIT ] COIN
The gas identification is: energy variables resources.
Electric power identification people: energy variables infrastructure, civilian.
And then finding out the range of the corresponding variable through the operation rule.
Thirdly, through how many variable characteristic words exist in the information in the report main column such as "power, gas and water supply industry", it is determined by the combination of the characteristic words and the indexes in the variable library which approximate rate the possible variables (e.g., indexes, groups.
Finally, according to other contents of the current report, determining the crossed cells of the main guest column, namely which indexes can be marked in the data filling area, and automatically marking the corresponding variable components in the unit cells of the data filling in a marking mode as shown in fig. 3.
When the user fills in the data, the filled-in numbers have the attributes of variables, such as indexes, groups, units.
It is to be understood that in some possible implementations, some other embodiments may include all or part of any of the above-described implementations, as long as they are implemented.
As shown in fig. 4, a structural framework diagram provided by an embodiment of the automatic variable classification system of the present invention is implemented based on machine learning, and is suitable for automatic variable classification of a statistical report, and the automatic variable classification system includes:
the system comprises an acquisition unit 1, a processing unit and a processing unit, wherein the acquisition unit 1 is used for acquiring a report to be processed, extracting text information of the report and storing the identified text information in a text object;
the recognition unit 2 is used for extracting text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable characteristic words from the words, and storing the extracted variable characteristic words in a part-of-speech recognition object;
the matching unit 3 is used for extracting variable characteristic words from the part-of-speech recognition objects, comparing the extracted variable characteristic words with variables in a variable word bank and forming a classification rule for extracting the characteristic words; and the classification unit 4 is used for extracting the variable feature words into corresponding variable blocks according to the classification rules.
The automatic variable classification system provided by this embodiment is implemented based on machine learning, and is suitable for automatic variable classification of statistical reports, and by extracting text information of reports, respectively performing part-of-speech recognition and feature extraction in sequence, then comparing with variables in a variable lexicon, constructing classification rules, and then performing automatic classification according to the classification rules, a method for creating automatic variable classification by using machine learning is implemented, and complicated variable identification work in a data statistics process can be solved.
Optionally, in some possible embodiments, the obtaining unit 1 is specifically configured to obtain a report to be processed, identify all filled content areas in the report, identify data in each cell and determine a data type of the data in each cell, and store the identified data and the data type in the text object.
By identifying and processing the filled content area, the subsequent steps of part-of-speech identification, feature extraction and the like on the data can be conveniently carried out, so that the classification precision is improved.
Optionally, in some possible embodiments, the recognition unit 2 is specifically configured to extract text information from the text object, split the text information into words by using a preset word segmentation algorithm, determine that each word is a noun, a verb, an adjective, or an imaginary word, if the word is an imaginary word, remove the corresponding word, use the remaining words as variable feature words, and store the extracted variable feature words in the part-of-speech recognition object.
If the particle word is used as the feature word, great noise is brought, so that the efficiency and the accuracy of the subsequent variable classification are directly reduced. Therefore, when the variable features are extracted, the virtual words with small classification use are removed, and words with strong expressive force on variable classification such as real words are used, so that the efficiency and the accuracy of subsequent variable classification can be further improved.
Optionally, in some possible embodiments, the classification unit 3 is specifically configured to:
when the variable characteristic words are regions or codes, adding the variable characteristic words into the code blocks;
when the variable characteristic words are the groups, adding the variable characteristic words into the grouping blocks;
when the variable characteristic words are measuring units, adding the variable characteristic words into the measuring unit blocks;
and when the variable characteristic words are indexes, adding the variable characteristic words into the metering index block.
By adding the variable feature words to different variable blocks according to the types of the variable feature words, accurate variable classification can be achieved.
Optionally, in some possible embodiments, the machine learning based variable automatic classification system further includes:
and the cleaning unit is used for cleaning the repeated records in each variable block according to a preset cleaning rule by the variable word bank so as to construct a standard variable word bank.
Repeated record items in the word stock block are cleaned, and versions of word stock storage variables are constructed, so that the subsequent automatic identification method can be conveniently used.
It is to be understood that in some possible implementations, some other embodiments may include all or part of any of the above-described implementations, as long as they are implemented.
It should be understood that the above embodiments are product embodiments corresponding to the method embodiments of the present invention, and the technical solutions of the two embodiments correspond, so that the detailed description of the product embodiments may refer to the above method embodiments, and will not be described herein again.
It is to be understood that the present invention may also provide a storage medium having stored therein instructions that, when read by a computer, cause the computer to execute the method for automatic classification of variables based on machine learning according to any of the embodiments described above.
It is to be understood that the present invention may also provide an electronic device comprising:
a memory for storing a computer program;
a processor for executing a computer program to implement the method for automatic classification of variables based on machine learning as described in any of the above embodiments.
The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described method embodiments are merely illustrative, and for example, the division of steps into only one logical functional division may be implemented in practice in another way, for example, multiple steps may be combined or integrated into another step, or some features may be omitted, or not implemented.
The above method, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A variable automatic classification method based on machine learning is characterized by comprising the following steps:
acquiring a report to be processed, extracting text information of the report, and storing the identified text information in a text object;
extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable characteristic words from the words, and storing the extracted variable characteristic words in a part-of-speech recognition object;
extracting the variable characteristic words from the part-of-speech recognition object, and comparing the extracted variable characteristic words with variables in a variable word bank to form a classification rule for extracting the characteristic words;
and extracting the variable feature words into corresponding variable blocks according to the classification rules.
2. The machine learning-based variable automatic classification method according to claim 1, wherein the method comprises the steps of obtaining a report to be processed, extracting text information of the report, and storing the identified text information in a text object, and specifically comprises the steps of:
the method comprises the steps of obtaining a report to be processed, identifying all filled content areas in the report, identifying data in each cell, judging the data type of the data in each cell, and storing the identified data and the data type in a text object.
3. The method according to claim 1, wherein the method for automatically classifying variables based on machine learning is characterized in that the text information is extracted from the text object, the text information is divided into words by using a preset word segmentation algorithm, variable feature words are extracted from the words, and the extracted variable feature words are stored in a part-of-speech recognition object, and specifically comprises:
extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, determining that each word is a noun, a verb, an adjective or a virtual word, if the word is the virtual word, removing the corresponding word, taking the rest words as variable characteristic words, and storing the extracted variable characteristic words in a part-of-speech recognition object.
4. The machine learning-based variable automatic classification method according to claim 1, wherein the extracting the variable feature words into corresponding variable blocks according to the classification rules specifically includes:
when the variable characteristic words are regions or codes, adding the variable characteristic words into a code block;
when the variable characteristic words are the groups, adding the variable characteristic words into the grouping blocks;
when the variable characteristic words are measuring units, adding the variable characteristic words into a measuring unit block;
and when the variable characteristic words are indexes, adding the variable characteristic words into a metering index block.
5. The machine-learning based variable automatic classification method according to any one of claims 1 to 4, further comprising:
and the variable word bank cleans repeated records in each variable block according to a preset cleaning rule to construct a standard variable word bank.
6. A machine learning based variable automatic classification system, comprising:
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a report to be processed, extracting text information of the report and storing the identified text information in a text object;
the recognition unit is used for extracting the text information from the text object, splitting the text information into words by using a preset word segmentation algorithm, extracting variable characteristic words from the words, and storing the extracted variable characteristic words in a part-of-speech recognition object;
the matching unit is used for extracting the variable characteristic words from the part-of-speech recognition objects, comparing the extracted variable characteristic words with variables in a variable word bank and forming a classification rule for extracting the characteristic words;
and the classification unit is used for extracting the variable feature words into corresponding variable blocks according to the classification rules.
7. The machine-learning-based variable automatic classification system according to claim 6, wherein the obtaining unit is specifically configured to obtain a report to be processed, identify all filled content areas in the report, identify data in each cell and determine a data type of the data in each cell, and store the identified data and the data type in a text object.
8. The machine-learning-based variable automatic classification system according to claim 6, wherein the recognition unit is specifically configured to extract the text information from the text object, split the text information into words by using a preset word segmentation algorithm, determine that each word is a noun, a verb, an adjective or an imaginary word, if the word is an imaginary word, remove the corresponding word, use the remaining words as variable feature words, and store the extracted variable feature words in a part-of-speech recognition object.
9. The machine-learning based variable automatic classification system according to claim 6, characterized in that the classification unit is specifically configured to:
when the variable characteristic words are regions or codes, adding the variable characteristic words into a code block;
when the variable characteristic words are the groups, adding the variable characteristic words into the grouping blocks;
when the variable characteristic words are measuring units, adding the variable characteristic words into a measuring unit block;
and when the variable characteristic words are indexes, adding the variable characteristic words into a metering index block.
10. The machine-learning based variable automatic classification system according to any of claims 6 to 9, further comprising:
and the cleaning unit is used for cleaning the repeated records in each variable block by the variable word bank according to a preset cleaning rule to construct a standard variable word bank.
CN202011272803.4A 2020-11-13 2020-11-13 Automatic variable classification method and system based on machine learning Active CN112381143B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011272803.4A CN112381143B (en) 2020-11-13 2020-11-13 Automatic variable classification method and system based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011272803.4A CN112381143B (en) 2020-11-13 2020-11-13 Automatic variable classification method and system based on machine learning

Publications (2)

Publication Number Publication Date
CN112381143A true CN112381143A (en) 2021-02-19
CN112381143B CN112381143B (en) 2023-12-05

Family

ID=74583933

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011272803.4A Active CN112381143B (en) 2020-11-13 2020-11-13 Automatic variable classification method and system based on machine learning

Country Status (1)

Country Link
CN (1) CN112381143B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006031143A (en) * 2004-07-13 2006-02-02 Fuji Xerox Co Ltd Document analysis device, document analysis method, and computer program
US20160103823A1 (en) * 2014-10-10 2016-04-14 The Trustees Of Columbia University In The City Of New York Machine Learning Extraction of Free-Form Textual Rules and Provisions From Legal Documents
CN106709032A (en) * 2016-12-29 2017-05-24 深圳市华傲数据技术有限公司 Method and device for extracting structured information from spreadsheet document
CN109710725A (en) * 2018-12-13 2019-05-03 中国科学院信息工程研究所 A kind of Chinese table column label restoration methods and system based on text classification
CN110728240A (en) * 2019-10-14 2020-01-24 北京华宇信息技术有限公司 Method and device for automatically identifying title of electronic file
CN110866217A (en) * 2019-10-24 2020-03-06 长城计算机软件与系统有限公司 Cross report recognition method and device, storage medium and electronic equipment
CN110929520A (en) * 2019-11-25 2020-03-27 北京明略软件系统有限公司 Non-named entity object extraction method and device, electronic equipment and storage medium
CN111291562A (en) * 2020-01-17 2020-06-16 中国石油集团安全环保技术研究院有限公司 Intelligent semantic recognition method based on HSE
KR102128852B1 (en) * 2020-03-30 2020-07-01 (주)위세아이텍 Device and method for visualizing key words of features extracted by applying principal component analysis to word vectors from text data
US20200302166A1 (en) * 2019-03-19 2020-09-24 Hyland Software, Inc. Computing system for extraction of textual elements from a document

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006031143A (en) * 2004-07-13 2006-02-02 Fuji Xerox Co Ltd Document analysis device, document analysis method, and computer program
US20160103823A1 (en) * 2014-10-10 2016-04-14 The Trustees Of Columbia University In The City Of New York Machine Learning Extraction of Free-Form Textual Rules and Provisions From Legal Documents
CN106709032A (en) * 2016-12-29 2017-05-24 深圳市华傲数据技术有限公司 Method and device for extracting structured information from spreadsheet document
CN109710725A (en) * 2018-12-13 2019-05-03 中国科学院信息工程研究所 A kind of Chinese table column label restoration methods and system based on text classification
US20200302166A1 (en) * 2019-03-19 2020-09-24 Hyland Software, Inc. Computing system for extraction of textual elements from a document
CN110728240A (en) * 2019-10-14 2020-01-24 北京华宇信息技术有限公司 Method and device for automatically identifying title of electronic file
CN110866217A (en) * 2019-10-24 2020-03-06 长城计算机软件与系统有限公司 Cross report recognition method and device, storage medium and electronic equipment
CN110929520A (en) * 2019-11-25 2020-03-27 北京明略软件系统有限公司 Non-named entity object extraction method and device, electronic equipment and storage medium
CN111291562A (en) * 2020-01-17 2020-06-16 中国石油集团安全环保技术研究院有限公司 Intelligent semantic recognition method based on HSE
KR102128852B1 (en) * 2020-03-30 2020-07-01 (주)위세아이텍 Device and method for visualizing key words of features extracted by applying principal component analysis to word vectors from text data

Also Published As

Publication number Publication date
CN112381143B (en) 2023-12-05

Similar Documents

Publication Publication Date Title
CN107844559A (en) A kind of file classifying method, device and electronic equipment
CN106776574B (en) User comment text mining method and device
Vivaldi et al. Improving term extraction by system combination using boosting
CN112883734B (en) Block chain security event public opinion monitoring method and system
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
CN110555206A (en) named entity identification method, device, equipment and storage medium
CN111563384A (en) Evaluation object identification method and device for E-commerce products and storage medium
KR20150037924A (en) Information classification based on product recognition
CN110321466A (en) A kind of security information duplicate checking method and system based on semantic analysis
CN111191051B (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN110134777A (en) Problem De-weight method, device, electronic equipment and computer readable storage medium
CN114880635A (en) User security level identification method, system, electronic device and medium of model integrated with lifting tree construction
CN114610838A (en) Text emotion analysis method, device and equipment and storage medium
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN111274390A (en) Emotional reason determining method and device based on dialogue data
CN113065329A (en) Data processing method and device
CN116629258B (en) Structured analysis method and system for judicial document based on complex information item data
CN112381143B (en) Automatic variable classification method and system based on machine learning
CN113761137A (en) Method and device for extracting address information
JPH06282587A (en) Automatic classifying method and device for document and dictionary preparing method and device for classification
Prieto et al. Text content based layout analysis
CN117743558A (en) Knowledge processing and knowledge question-answering method, device and medium based on large model
CN117291192A (en) Government affair text semantic understanding analysis method and system
CN116956930A (en) Short text information extraction method and system integrating rules and learning models
CN114385894B (en) Dictionary-based public opinion monitoring method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100190 17-19 / F, building a 1, 66 Zhongguancun East Road, Haidian District, Beijing

Applicant after: New Great Wall Technology Co.,Ltd.

Address before: 100190 17-19 / F, building a 1, 66 Zhongguancun East Road, Haidian District, Beijing

Applicant before: GREAT WALL COMPUTER SOFTWARE & SYSTEMS Inc.

GR01 Patent grant
GR01 Patent grant