CN114359567A - Feature data extraction method and device - Google Patents

Feature data extraction method and device Download PDF

Info

Publication number
CN114359567A
CN114359567A CN202111674913.8A CN202111674913A CN114359567A CN 114359567 A CN114359567 A CN 114359567A CN 202111674913 A CN202111674913 A CN 202111674913A CN 114359567 A CN114359567 A CN 114359567A
Authority
CN
China
Prior art keywords
feature
feature data
data
characteristic
definition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111674913.8A
Other languages
Chinese (zh)
Inventor
王飞
蔡伊林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Aixinnuo Aerospace Information Co ltd
Original Assignee
Guizhou Aixinnuo Aerospace Information Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Aixinnuo Aerospace Information Co ltd filed Critical Guizhou Aixinnuo Aerospace Information Co ltd
Priority to CN202111674913.8A priority Critical patent/CN114359567A/en
Publication of CN114359567A publication Critical patent/CN114359567A/en
Pending legal-status Critical Current

Links

Images

Abstract

A method of feature data extraction, comprising: determining source information; determining a defining characteristic of the characteristic data; extracting the characteristic data; verifying the validity of the actual characteristics of the extracted characteristic data; judging whether the actual characteristic and the defined characteristic have errors or not; if the error exists, cleaning the characteristic data; positioning the step of generating errors and optimizing the defined features; setting and determining a subsequent flow. The method and the device can improve the extraction accuracy of the characteristic data with high requirement on the sensitivity.

Description

Feature data extraction method and device
Technical Field
The invention belongs to the field of data processing and identification, and particularly relates to a characteristic data extraction method and device.
Background
When enterprise information is managed in a centralized mode, information such as enterprise names and organization codes is required to be accurate, and when a large number of information is input, the problems that formats are not uniform, manual input is prone to errors and the like exist. Although there are many schemes for recognizing characters at present, some errors exist in some complex data after the data is processed in an automatic acquisition mode, so some methods are needed, and the error correction reasons are sorted and classified each time an error is generated, so that the error correction and the error correction reasons are used for reprocessing information in a complex scene, and the same error is prevented from happening again.
There are many solutions that identify copied text, pictures, in which the text is identified, and in which the identification of characteristic data is performed, such as the patent application with application number 201710318767.2. However, for information managed by industries such as tax, finance and the like, the identified information is sensitive, the digital code is long, errors are easy to occur, and meanwhile, manual verification is difficult, so that a set of mechanism is needed to improve the identification accuracy.
Disclosure of Invention
The invention aims to provide a method and a device for extracting feature data, which are used for improving the extraction accuracy of the feature data with high sensitivity requirement by defining, confirming and continuously optimizing the definition features of the feature data in source information.
In order to solve the above technical problem, the present invention provides a method for extracting feature data, which comprises the steps of:
determining source information, wherein the source information refers to a large text set from which feature data needs to be extracted;
determining definition characteristics of the characteristic data, wherein the definition characteristics are characteristic summarization of the content of the characteristic data which is expected to be extracted before the source information is extracted;
extracting the feature data from the source information according to the definition features of the feature data; the extracted characteristic data is characterized as an actual characteristic;
verifying the effectiveness of the actual features of the extracted feature data, and judging whether the actual features and the defined features have errors or not;
if an error exists, cleaning the feature data, comparing the actual feature of the extracted feature data with the defined feature, positioning the step of generating the error, and optimizing the defined feature;
and after the definition characteristics are optimized, determining a subsequent process according to the setting, wherein the subsequent process comprises outputting characteristic data, re-determining the definition characteristics and re-extracting the characteristic data.
On the other hand, the invention also provides a device for extracting the characteristic data, which comprises a source identification unit, a character extraction unit and a character extraction unit, wherein the source identification unit is used for generating character information which needs to be extracted by the characteristic data, and the character information comprises a picture which is identified and converted into a character unit;
the system comprises a definition characteristic unit, a definition characteristic module and a characteristic analysis unit, wherein the definition characteristic unit is used for analyzing definition characteristics of characteristic data, and the definition characteristic unit comprises an intelligent definition characteristic module and is used for summarizing definition characteristics such as length and content according to existing characteristic data;
the information extraction unit is used for extracting the characteristic data of the source information in combination with the definition characteristic unit;
the information verification unit is used for outputting the accuracy of the characteristic data through processing by inputting the characteristic data;
and the information cleaning unit is used for processing the characteristic data with errors, including deletion, warehousing and analysis.
Further, the device for extracting the feature data further comprises a manual access unit and a process definition unit, wherein the manual access unit comprises a manual feature definition module, a manual feature correction data module and a manual feature correction definition module, and the manual feature definition module is applied to the feature definition unit;
and the process definition unit is used for determining different subsequent processes of the defined feature optimization under different scenes.
Meanwhile, the manual access unit can also be used for determining the purpose of the defined characteristic, for example, a certain defined characteristic is used for extracting characteristic data from the source information or for verifying the characteristic data.
According to the method and the device for extracting the feature data, the feature data is defined, extracted and corrected, the definition features and the whole process are continuously optimized, the data of scenes and errors are accumulated and analyzed, the extraction accuracy rate of sensitive feature data is improved, and the working efficiency is improved.
Drawings
Fig. 1 is a schematic flow chart of a feature data extraction method according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating a specific determination process defined in a feature data extraction method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating another process of defining features in a feature data extraction method according to an embodiment of the present invention;
FIG. 4 is a flow chart of a method for defining feature optimization for feature data according to an embodiment of the present invention;
fig. 5 is a configuration diagram of a feature data extraction device according to an embodiment of the present invention.
Detailed Description
For better understanding of the objects, structure and functions of the present invention, a method and apparatus for extracting feature data according to the present invention will be described in detail with reference to the accompanying drawings.
The invention provides a method for extracting feature data from source information, as shown in fig. 1, the flow includes the following steps:
s100, determining source information, wherein the source information refers to a large text set from which feature data needs to be extracted, such as a company profile from which a company name needs to be extracted. However, in practical application scenarios, the source may be a text or a picture. Therefore, the source information is determined, including the validity conversion of the source information, for example, after the image is recognized as text, the information including irregular characters, spaces and the like is denoised, the source information with unlimited format but valid content is formed.
The following are exemplified in this embodiment:
source information 1 is a whole text copied from a system:
taxpayer identification number: 99887766554433221X, taxpayer name: health development limited, address: the 99 th-place No. 1 No. 20 layer No. 7 of the health garden in south Ming district of Guiyang City, telephone 085166778899, bank of opening an account: china Bank Guiyang health garden branch, account number: 556677889
Source information 2: scanning the entire text identified after the picture from the financial information file provided by a certain staff:
billing information:
name of taxpayer: health development Co Ltd
Taxpayer identification number: 99887766554433221X
Address: no. 1 No. 20 layer No. 7 health garden X of Nanming district in Guiyang City
Telephone 0851-66778899
Bank of opening an account: branches of Guiyang health garden of China Bank
Account number: 556677889
The two sections of source information include the following extractable characteristic data: tax person identification number, taxpayer name, address, telephone, bank of opening an account.
S110: determining definition characteristics of the characteristic data, wherein the definition characteristics are characteristic summarization of the content of the characteristic data which is expected to be extracted before the source information is extracted;
the content of defining the characteristics comprises:
the relative position of key characters of the characteristic data in the source information, such as the relative position of characters or character strings which play a role in identification in the characteristic data from the initial character of the source information; a relative position from a certain signature of the source information;
the length of the key character, e.g.; the characteristic data is a telephone number, the key characters are regional digits, and the length of the key characters can be 3 digits or 4 digits;
the key characters are located at the positions of the feature data, such as the relative positions of the head, the string and the tail of the feature data;
the length of the feature data, such as the length of a telephone number is 11 digits;
whether to exclude special key characters, such as the taxpayer identification number I, O, Z, S, V.
Before determining the defined features of the feature data, first, the features of the source information are analyzed and summarized, wherein the source information includes a plurality of different or identical separation marks, such as full-angle or half-angle punctuation marks: colons (: commas (,), spaces (,), periods (,), linens, etc. The feature data is not limited to one group, and may be feature data whose group name corresponds to the content. As in this embodiment, the source information includes the key information name: taxpayer identification number and taxpayer name; characteristic data contents, such as: 99887766554433221X, health development Limited.
Taking "taxpayer identification number" as an example, when the order of the feature data in the source information 1 is stable, the name and the content need to be determined by defining features respectively, and the determination flow is shown in fig. 2:
s210: starting position of feature data name: the sixth character of the source information is as follows:
characteristic data, name, start position, source information, position (6)
S220: end position of feature data name: in this embodiment, the content of the first delimiter after the start position needs to be determined, and then the first position after the start position of the feature data name in the source information is determined according to the content, and the end position of the feature data can be determined according to the position and the length of the first delimiter after the start position of the feature data name. The determination formula is as follows:
str (characteristic data, name, start position, delimiter, start position) is the character length from the start position of the characteristic data name to the first delimiter in the source information;
the name, end position, first delimiter, position;
before this step is performed, the source information needs to be analyzed to determine the delimiters.
S230: starting position of feature data content: the formula is as follows:
content, start position, first delimiter, position + first delimiter, length + 1;
s240: the length of the characteristic data content; the content length may be determined by means of a separator, and in a special scenario, the content length is determined, such as the id card information, the taxpayer identification number information in the embodiment, and the length is determined to be 18.
The defining characteristics of the characteristic data also include, with or without the inclusion of characters, in this example, the tax payer identification number field does not include I, O, Z, S, V. In this example, this feature is not used for feature data extraction, but may be used for accuracy verification, i.e., step S250.
The invention also provides a method for defining the characteristics of the characteristic data under the condition that the sequence of each characteristic data in the source information is not fixed, which can be determined by the positioning of the name of the characteristic data and the length of the content of the characteristic data or a separator, and the flow of the method is shown as the following figure 3:
s310: determining content defining a feature, comprising:
defining a feature data name: such as: "taxpayer identification number";
defining the characteristic data content length: 18
Defining the accuracy check mode: verification and review can be performed by data characteristics, such as the taxpayer identification number being 18 bits in length and the field not containing I, O, Z, S, V
S320: extracting a starting position Loc _ key _ name of the feature data name, such as: and searching the position of the taxpayer identification number in the source information 2, and outputting a search result: 21;
s330: extracting the name length Loc _ key _ length of the characteristic data, and extracting the length of the taxpayer identification number: 6
S340: determining the content start position Loc _ key _ info corresponding to the feature data name 21+6+1
S341: the separation flag is determined, and the extraction result of the character at Loc _ key _ info in the source information 2 is: ": ",
s350: judging whether the mark is a separation mark, and if the mark is the separation mark, correcting Loc _ key _ info to Loc _ key _ info + 1; otherwise, taking down a separation mark, and extracting information between separation marks at the next position as key information;
s360: locating the position Loc _ next of the next separator, and taking the source information: the character at Loc _ key _ info, in source information 2, here the character is: "9", the position of the first separator mark after Loc _ key _ info in the source information is fetched, and in source information 2, the next separator mark is line break, position 47.
S370: extracting characteristic data according to Loc _ key _ info and Loc _ next, wherein the information of 21+6+1+1 to 47 is key information content
S390: and outputting the extracted characteristic data to prepare for the next operation, such as accuracy check.
S120: extracting feature data from the source information according to the definition features; however, not all the definition features need to be used in the extraction, so the method also provides manual setting, and the priority for extracting and using all the definition features in the extraction algorithm is sorted or the definition of the attributes is provided, and the attributes comprise purposes and the like. For example, in the present example, the taxpayer identification number does not contain I, O, Z, S, V, and the use attribute of this feature is the feature data check, i.e., used in the extraction, used for the check of step S250 of fig. 2.
The extracted characteristic data is characterized as an actual characteristic;
s130: and verifying the validity of the actual features of the extracted feature data, and judging whether the actual features have errors with the defined features: the verification means includes validity checking by using the defined characteristics of the usage attributes, for example: verifying I, O, Z, S, V whether the extracted taxpayer identification number is contained;
the verification means further comprises performing verification by a third party platform,
1. and interface calling and verifying:
calling related information interfaces, such as: the invoice information platform interface carries out information verification of taxpayer names and identification numbers;
acquiring an organization code: performing information verification of organization codes, addresses and enterprise names through an enterprise information platform interface;
for example, in this example, the key information results in obtaining the taxpayer name in the invoice information platform, and the taxpayer name is retrieved from the source information, and if the taxpayer name exists, the verification can be completed.
2. A manual verification platform: for information which is not verified and is verified by the interface platform, whether the information enters the manual verification platform or not can be determined according to requirements, and the information is used for verifying whether extraction of key information is accurate or not.
3. An internal information base: and for the verified information, the source information and the key information are put into a warehouse, so that the method is convenient for repeated use and result comparison.
S140: if an error exists, the extracted feature data is inaccurate, or a plurality of pieces of feature data are generated, feature data cleaning can be carried out, the actual features of the extracted feature data are compared with the defined features, an error library is established by defining the error between the features and the actual features, the corresponding relation established between the error and the generated scene is saved, and the step of generating the error is positioned: when the defined features are accurate and errors are caused by insurmountable factors, the preparation flow adjusts the feature data, for example: screening an error library: establishing an error library, importing source information and key information which generate errors into the error library, and screening the key information according to the scene environment.
Data cleansing, namely processing the generated feature data, and simultaneously optimizing the generation process of the feature data, namely S150: and optimizing the definition characteristics, and adjusting and optimizing the definition characteristics determined in the step S110 on the basis of characteristic data cleaning.
In the solution of the embodiment shown in fig. 3, a method for searching feature data content by using location of a feature data name + a separation mark is adopted, and in this embodiment, a method for defining feature optimization after a link generating an error is located is provided, that is, a link generating an error is deduced back through extracted feature data, as shown in fig. 4, after the error is determined, key information feature adjustment is performed:
s410 and S420: firstly, determining corrected feature data info _ key and source information info _ source;
s430 and S440: searching the number num _ key of the feature data in the source information, and in the next step, firstly determining the first position of the feature data in the source information and carrying out back-check on the first feature data;
s440 to S490: in the example of the present application, the length of the separation mark is 1, so whether the information content of the first 1+ feature name length of the position is extracted to meet the following conditions: the key information name + separation mark is determined, if the key information name + separation mark is not matched with the key information name + separation mark, the key information name + separation mark is determined to be abnormal;
and performing backward judgment on the next position according to the Num _ key-1.
After the definition characteristics of each piece of key information are optimized, the processing of the steps shown in fig. 1 can be performed, or the verification step can be redefined according to the rules.
S160: determining a subsequent process, and defining a process to be performed next according to a scene and requirements, for example, in the scene at that time, although the extracted feature data has errors, the result can be output after manual cleaning and adjustment, and the feature data extraction at this time is completed; if the definition feature is correct and is a defect of the source information, the step S120 can be performed again after the source information is adjusted; the process may also proceed to step S110 to re-determine the definition characteristics.
The characteristic data extraction method provided by the invention defines, extracts and revises the characteristic data, continuously optimizes the defined characteristic and the whole process, accumulates and analyzes the data of scenes and errors, improves the extraction accuracy of sensitive characteristic data and improves the working efficiency.
The invention also provides a device for extracting the characteristic data, which is matched with the method for extracting the characteristic data. The feature data extraction device is mainly used for processing source information, defining features, feature data and processes, and the structure of the feature data extraction device is shown in fig. 5, and the feature data extraction device mainly comprises the following unit modules:
the source identification unit comprises a picture identification conversion module which can identify the picture as character information; the system also comprises a character pasting module which can paste large sections of characters from other sources for use, and after the character information is processed, the source information which meets certain specifications and can be used for characteristic data extraction is generated;
the system comprises a definition characteristic unit, a definition characteristic module and a characteristic analysis module, wherein the definition characteristic unit is used for analyzing definition characteristics of characteristic data and comprises an intelligent definition characteristic module, the intelligent definition module is used for summarizing definition characteristics such as length, content and the like of the existing characteristic data according to the existing characteristic data, and when the number of the characteristics used for analysis reaches a certain number, an artificial learning algorithm can be adopted;
the feature defining unit also comprises a manual feature defining module which provides an interface for manually defining the defined features of the feature data and keeps the operation records.
The manual definition feature module is one of the manual access units, and the manual access unit further includes: the system comprises a manual correction definition feature module and a manual correction feature data module.
The manual revision definition feature module is used for defining feature optimization, and compared with the manual definition feature module, the interface of the manual revision definition feature module further comprises a recording device for reasons of revision, for example: and defining information of error factors, reverse steps and the like of the features and the actual features. Therefore, the manual correction definition feature module belongs to a complex version device of the manual definition feature module.
And the manual characteristic data correction module is used for manual intervention, adjusting the extracted characteristic data and keeping the operation record.
The manual access unit may also add attributes defining features such as: and the application attribute and the scene attribute are used for judging which definition features need to be used when the source information is used for extracting the feature data and verifying the feature data. As in this embodiment, the length of the feature data may be used for extraction, and the characters not included in the feature data may be used for verification.
The above is a device for preparing in a feature data extraction device, and after the data preparation is completed, the device can combine a source identification unit and a definition feature unit to process feature data, and the related device includes:
the information extraction unit is used for extracting the characteristic data of the source information in combination with the definition characteristic unit;
the information verification unit is used for outputting the accuracy of the characteristic data through processing by inputting the characteristic data;
the information cleaning unit is used for processing the characteristic data with errors, including deletion, warehousing and analysis, and optimizing the definition characteristics;
and the process definition unit is used for determining different subsequent processes of the defined feature optimization under different scenes.
The device for extracting the characteristic data, provided by the invention, respectively deploys different devices for the source information, the analysis definition of the characteristic data to be extracted and the extracted characteristic data, provides an interface for manual intervention management and a process for repeatedly optimizing the algorithm under the condition that the algorithm for extracting the characteristic data is not perfect, continuously perfects various scenes and data and environments generating errors, improves the extraction accuracy of sensitive characteristic data and improves the working efficiency.
It is to be understood that the present invention has been described with reference to certain embodiments, and that various changes in the features and embodiments, or equivalent substitutions may be made therein by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims (10)

1. A method of feature data extraction, comprising:
determining source information, wherein the source information refers to a large text set from which feature data needs to be extracted;
determining definition characteristics of characteristic data, wherein the definition characteristics are characteristic summarization of the content of the characteristic data which is expected to be extracted before the source information is extracted;
extracting feature data from the source information according to the definition features of the feature data; the extracted characteristic data is characterized as an actual characteristic;
verifying the effectiveness of the actual features of the extracted feature data, and judging whether the actual features and the defined features have errors or not;
if the error exists, cleaning the feature data, comparing the actual feature of the extracted feature data with the defined feature, positioning the step of generating the error, and optimizing the defined feature;
and after the definition features are optimized, determining a subsequent process according to the setting, wherein the subsequent process comprises outputting feature data, re-determining the definition features and re-extracting the feature data.
2. The method of claim 1, wherein one of the source information includes one or more groups of feature data with names corresponding to contents, and the feature data are separated by a plurality of different or same separation marks;
the separation mark includes punctuation marks distinguishing full and half angles.
3. The method of feature data extraction according to claim 1, wherein the determining of the defined features comprises: the relative position of key characters of the characteristic data in the source information; the length of the key character and the position of the key character in the feature data; a length of the feature data; whether to exclude special key characters.
4. The method of feature data extraction as claimed in claim 1, wherein the feature data validation comprises: defining feature verification, manual verification, third-party platform interface calling verification and internal information base verification.
5. The method of feature data extraction as claimed in claim 4, wherein the feature data cleansing comprises: defining errors between the features and the actual features to establish an error library, storing the corresponding relation established between the errors and the generated scenes, positioning the steps generating the errors, and optimizing the defined features.
6. The method of feature data extraction according to claim 5, wherein the defined feature optimization is a defined feature re-determination based on feature data cleaning;
the defining feature optimization includes modifying feature data in a human management platform.
7. The method of feature data extraction as claimed in claim 6, wherein the determination of the follow-up process comprises providing settings at a human management platform.
8. An apparatus for feature data extraction, comprising:
the source identification unit is used for generating source information which needs to be subjected to characteristic data extraction, and comprises a picture identification conversion module and a character pasting module which are used for converting picture identification into characters;
the intelligent defined characteristic module is used for summarizing defined characteristics such as length, contained content and the like according to existing characteristic data;
the information extraction unit is used for extracting the characteristic data of the source information in combination with the definition characteristic unit;
the information verification unit is used for outputting the accuracy of the characteristic data through processing by inputting the characteristic data;
and the information cleaning unit is used for processing the characteristic data with errors, including deletion, warehousing and analysis.
9. The apparatus for extracting feature data of claim 8, further comprising a manual access unit and a flow definition unit,
the manual access unit comprises a manual definition feature module, a manual correction feature data module and a manual correction definition feature module, wherein the manual definition feature module is applied to the definition feature unit;
the process definition unit is used for determining different subsequent processes of defining feature optimization under different scenes.
10. The apparatus for extracting feature data according to claim 8, wherein the manual access unit is configured to determine a purpose of defining the feature, and the purpose includes a purpose of extracting the feature data from the source information and a purpose of verifying the feature data.
CN202111674913.8A 2021-12-31 2021-12-31 Feature data extraction method and device Pending CN114359567A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111674913.8A CN114359567A (en) 2021-12-31 2021-12-31 Feature data extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111674913.8A CN114359567A (en) 2021-12-31 2021-12-31 Feature data extraction method and device

Publications (1)

Publication Number Publication Date
CN114359567A true CN114359567A (en) 2022-04-15

Family

ID=81105044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111674913.8A Pending CN114359567A (en) 2021-12-31 2021-12-31 Feature data extraction method and device

Country Status (1)

Country Link
CN (1) CN114359567A (en)

Similar Documents

Publication Publication Date Title
US9690788B2 (en) File type recognition analysis method and system
CN110929580A (en) Financial statement information rapid extraction method and system based on OCR
CN101523413A (en) Automated generation of form definitions from hard-copy forms
CN111428599A (en) Bill identification method, device and equipment
CN112418812A (en) Distributed full-link automatic intelligent clearance system, method and storage medium
WO2006136055A1 (en) A text data mining method
CN112328631A (en) Production fault analysis method and device, electronic equipment and storage medium
CN114398315A (en) Data storage method, system, storage medium and electronic equipment
CN113469005A (en) Recognition method of bank receipt, related device and storage medium
CN115904482B (en) Interface document generation method, device, equipment and storage medium
CN114359567A (en) Feature data extraction method and device
CN115756486A (en) Data interface analysis method and device
CN115294586A (en) Invoice identification method and device, storage medium and electronic equipment
CN110597765A (en) Large retail call center heterogeneous data source data processing method and device
CN115587098A (en) Method and system for intelligently identifying chart data
CN116403233A (en) Image positioning and identifying method based on digitized archives
CN115309705A (en) Data integration classification system and method for automatically identifying basic data elements of urban information model platform
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium
CN113158988A (en) Financial statement processing method and device and computer readable storage medium
CN108415930B (en) Data analysis method and device
CN112925874A (en) Similar code searching method and system based on case marks
CN111241096A (en) Text extraction method, system, terminal and storage medium for EXCEL document
CN108416895A (en) A kind of enterprise's invoice input system and method based on image recognition technology
CN114047927B (en) Database code conversion method and system
CN113094520B (en) Method for checking electronic contract

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination