CN110515999A - General record processing method, device, electronic equipment and storage medium - Google Patents

General record processing method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110515999A
CN110515999A CN201910799571.9A CN201910799571A CN110515999A CN 110515999 A CN110515999 A CN 110515999A CN 201910799571 A CN201910799571 A CN 201910799571A CN 110515999 A CN110515999 A CN 110515999A
Authority
CN
China
Prior art keywords
field
record
original
matched
gauge outfit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910799571.9A
Other languages
Chinese (zh)
Inventor
张亦鹏
安思宇
刘明浩
姚荣洁
郭江亮
李旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910799571.9A priority Critical patent/CN110515999A/en
Publication of CN110515999A publication Critical patent/CN110515999A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2468Fuzzy queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Abstract

This application discloses general record processing method, device, electronic equipment and storage mediums, are related to field of cloud computer technology.Specific implementation are as follows: identify the gauge outfit pattern of original record;Record row is extracted from original record based on gauge outfit pattern;Original field in original record is matched with preset criteria field;In the record row extracted, corresponding original field is replaced with the criteria field of successful match, generates general record text.The embodiment of the present application can automate solution new record, and carry out record field standardization, provide the record data of unified reference format, and treatment effeciency can be substantially improved, and save human resources.

Description

General record processing method, device, electronic equipment and storage medium
Technical field
This application involves a kind of field of computer technology more particularly to a kind of technical field of information processing.
Background technique
For the data of current application platform usually from different data sources, data compatibility is poor, is difficult to unification.For example, Current staple commodities transaction platform, it is common by providing the standard form of different classes of commodity supply data, and be strictly required User fills up a form by specification, to obtain standardized source of goods data.Staple commodities transaction platform is wished to relative standard's Mode obtains the source of goods data of supplier, but there is the data template of oneself in each supply of material commercial city, it is difficult to unified.To staple commodities For the user of transaction platform, original source of goods data are stored according to specific format, re-start format conversion or standardization Higher cost causes platform higher using threshold.If be responsible for by transaction platform, people is put into for each new source of goods data template Power carries out the conversion of manual data format or writes Data Format Transform tool, then introduces a large amount of repeated works, be the wave to manpower Take.To sum up, traditional approach is used to establish the data template table of comparisons for each data source, business personnel's workload is huge, there is weight Return to work work, low efficiency defect.
Summary of the invention
The embodiment of the present application proposes a kind of general record processing method, device, electronic equipment and storage medium, at least to solve The above technical problem certainly in the prior art.
In a first aspect, the embodiment of the present application provides a kind of general record processing method, comprising:
Identify the gauge outfit pattern of original record;
Record row is extracted from original record based on gauge outfit pattern;
Original field in original record is matched with preset criteria field;
In the record row extracted, corresponding original field is replaced with the criteria field of successful match, generates general note Record text.
In the embodiment of the present application, solution new record can be automated, and carry out record field standardization, unification is provided Treatment effeciency can be substantially improved in the record data of reference format, save human resources.
In one embodiment, the gauge outfit pattern of original record is identified, comprising:
Determine the gauge outfit line range of original record;
In gauge outfit line range, the aiming field in each record row is matched with preset gauge outfit keyword;
All aiming fields in record row determine record and in the corresponding successful situation of gauge outfit Keywords matching The accurate successful match of row;
Using the record row of accurate successful match as gauge outfit row.
It is the important of progress charting batch positioning by accurate match cognization gauge outfit pattern in the embodiment of the present application Link provides basis on location for the step of subsequent extraction record.
In one embodiment, the method also includes:
In the case where the record row in gauge outfit line range accurately matches unsuccessful situation, the first matching degree mixing index is calculated, First matching degree mixing index is the matching degree mixing index of the aiming field and preset gauge outfit keyword in each record row;
All aiming fields in record row are greater than with the first matching degree mixing index of corresponding gauge outfit keyword In the case where the first preset threshold, record row fuzzy matching success is determined;
Fuzzy matching is successfully recorded to row as gauge outfit row.
In the embodiment of the present application, gauge outfit pattern is identified by fuzzy matching, improves fault-tolerant ability, for normative poor Data can also reach preferable recognition effect.
In one embodiment, record row is extracted from original record based on gauge outfit pattern, comprising:
It regard the corresponding column serial number distribution of effective column data in original record as record rule;
Record row is extracted from original record according to record rule and gauge outfit pattern.
In the embodiment of the present application, positioned carrying out charting batch based on gauge outfit pattern and record rule, it is basic herein The upper subsequent Text normalization processing of progress can ensure that the validity and standardization of data, and improve treatment effeciency.
In one embodiment, the original field in original record is matched with preset criteria field, comprising:
The history match of original field and the successful match of criteria field is recorded into write-in caching;
If the original field successful match in current original field to be matched and history match record, it is determined that currently to Matched original field and criteria field pass through cache match success.
In the embodiment of the present application, processing speed can be promoted with criteria field by the original field of Data Matching in caching, Lifting system performance.
In one embodiment, the method also includes:
In the case where current original field and criteria field to be matched are by the unsuccessful situation of cache match, will currently to The original field matched is matched with the criteria field that preset field value is concentrated;
If the criteria field successful match that current original field to be matched and preset field value are concentrated, it is determined that current Original field and criteria field to be matched passes through field value collection successful match.
In the embodiment of the present application, original field and criteria field are matched by field value collection, it can be ensured that matched accuracy, And then improve the accuracy for generating data.
In one embodiment, the method also includes:
It, will be current in the case where current original field to be matched matches unsuccessful situation by field value collection with criteria field Original field to be matched is matched with the alias of the criteria field in default rule library, wherein rule base is for storing Mapping relations between criteria field and the alias of criteria field;
If the alias match success of the criteria field in current original field and default rule library to be matched, it is determined that Current original field and criteria field to be matched pass through rule base successful match.
In the embodiment of the present application, by regular storehouse matching original word section and criteria field, due to rule base be stored with it is original Mapping relations between field and the alias of criteria field make data have compatibility, improve the processing capacity of system.
In one embodiment, the method also includes:
In the case where current original field and criteria field to be matched pass through the unsuccessful situation of regular storehouse matching, second is calculated Matching degree mixing index, the second matching degree mixing index are current original field to be matched and the criteria field that field value is concentrated Matching degree mixing index;
In the case where the second matching degree mixing index is more than or equal to the second preset threshold, determine current to be matched original Field and criteria field fuzzy matching success.
In the embodiment of the present application, fault-tolerant ability is improved using the method that matching degree mixing index carries out fuzzy matching, it is right Preferable data normalization effect can also be reached in normative poor data.
Second aspect, the embodiment of the present application provide a kind of general record processing unit, comprising:
Recognition unit, for identification the gauge outfit pattern of original record;
Extracting unit is used for: record row is extracted from original record based on gauge outfit pattern;
Matching unit, for matching the original field in original record with preset criteria field;
Generation unit is used for: in the record row extracted, replacing corresponding original word with the criteria field of successful match Section generates general record text.
In one embodiment, recognition unit includes the first identification subelement, and the first identification subelement is used for:
Determine the gauge outfit line range of original record;
In gauge outfit line range, the aiming field in each record row is matched with preset gauge outfit keyword;
All aiming fields in record row determine record and in the corresponding successful situation of gauge outfit Keywords matching The accurate successful match of row;
Using the record row of accurate successful match as gauge outfit row.
In one embodiment, recognition unit further includes the second identification subelement, and the second identification subelement is used for:
In the case where the record row in gauge outfit line range accurately matches unsuccessful situation, the first matching degree mixing index is calculated, First matching degree mixing index is the matching degree mixing index of the aiming field and preset gauge outfit keyword in each record row;
All aiming fields in record row are greater than with the first matching degree mixing index of corresponding gauge outfit keyword In the case where the first preset threshold, record row fuzzy matching success is determined;
Fuzzy matching is successfully recorded to row as gauge outfit row.
In one embodiment, extracting unit is used for:
It regard the corresponding column serial number distribution of effective column data in original record as record rule;
Record row is extracted from original record according to record rule and gauge outfit pattern.
In one embodiment, matching unit includes the first coupling subelement, and the first coupling subelement is used for:
The history match of original field and the successful match of criteria field is recorded into write-in caching;
If the original field successful match in current original field to be matched and history match record, it is determined that currently to Matched original field and criteria field pass through cache match success.
In one embodiment, matching unit further includes the second coupling subelement, and the second coupling subelement is used for:
In the case where current original field and criteria field to be matched are by the unsuccessful situation of cache match, will currently to The original field matched is matched with the criteria field that preset field value is concentrated;
If the criteria field successful match that current original field to be matched and preset field value are concentrated, it is determined that current Original field and criteria field to be matched passes through field value collection successful match.
In one embodiment, matching unit further includes third coupling subelement, and third coupling subelement is used for:
It, will be current in the case where current original field to be matched matches unsuccessful situation by field value collection with criteria field Original field to be matched is matched with the alias of the criteria field in default rule library, wherein rule base is for storing Mapping relations between criteria field and the alias of criteria field;
If the alias match success of the criteria field in current original field and default rule library to be matched, it is determined that Current original field and criteria field to be matched pass through rule base successful match.
In one embodiment, matching unit further includes the 4th coupling subelement, and the 4th coupling subelement is used for:
In the case where current original field and criteria field to be matched pass through the unsuccessful situation of regular storehouse matching, second is calculated Matching degree mixing index, the second matching degree mixing index are current original field to be matched and the criteria field that field value is concentrated Matching degree mixing index;
In the case where the second matching degree mixing index is more than or equal to the second preset threshold, determine current to be matched original Field and criteria field fuzzy matching success.
The third aspect, the embodiment of the present application provide a kind of electronic equipment, comprising:
At least one processor;And
The memory being connect at least one processor communication;Wherein,
Memory is stored with the instruction that can be executed by least one processor, and instruction is executed by least one processor, with At least one processor is set to be able to carry out method provided by the application any one embodiment.
Fourth aspect, the embodiment of the present application provide a kind of non-instantaneous computer-readable storage for being stored with computer instruction Medium, the computer instruction is for making the computer execute method provided by the application any one embodiment.
One embodiment in above-mentioned application has the following advantages that or the utility model has the advantages that solution new record can be automated, and goes forward side by side Row record field standardization provides the record data of unified reference format, and treatment effeciency can be substantially improved, and saves manpower Resource.
Other effects possessed by above-mentioned optional way are illustrated hereinafter in conjunction with specific embodiment.
Detailed description of the invention
Attached drawing does not constitute the restriction to the application for more fully understanding this programme.Wherein:
Fig. 1 is the flow chart according to the general record processing method of the embodiment of the present application;
Fig. 2 is the flow chart according to the identification gauge outfit pattern of the general record processing method of the embodiment of the present application;
Fig. 3 is the flow chart according to the identification gauge outfit pattern of the general record processing method of the embodiment of the present application;
Fig. 4 is the flow chart according to the extraction record row of the general record processing method of the embodiment of the present application;
Fig. 5 is the matched flow chart according to the general record processing method of the embodiment of the present application;
Fig. 6 is the matched flow chart according to the general record processing method of the embodiment of the present application;
Fig. 7 is the matched flow chart according to the general record processing method of the embodiment of the present application;
Fig. 8 is the matched flow chart according to the general record processing method of the embodiment of the present application;
Fig. 9 A is the knowledge mapping schematic diagram according to the general record processing method of the embodiment of the present application;
Fig. 9 B is the knowledge mapping schematic diagram according to the general record processing method of the embodiment of the present application;
Figure 10 is the module design and data flow diagram according to the general record processing method of the embodiment of the present application;
Figure 11 is the general record processing device structure diagram according to the embodiment of the present application;
Figure 12 is the general record processing device structure diagram according to the embodiment of the present application;
Figure 13 is the general record processing device structure diagram according to the embodiment of the present application;
Figure 14 is the block diagram for the electronic equipment for realizing the general record processing method of the embodiment of the present application.
Specific embodiment
It explains below in conjunction with exemplary embodiment of the attached drawing to the application, including the various of the embodiment of the present application Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from the scope and spirit of the present application.Together Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
Fig. 1 is the schematic diagram according to the general record processing method of the embodiment of the present application.The general record processing method packet It includes:
Step S110 identifies the gauge outfit pattern of original record;
Step S120 extracts record row based on gauge outfit pattern from original record;
Step S130 matches the original field in original record with preset criteria field;
Step S140 replaces corresponding original field with the criteria field of successful match in the record row extracted, raw At general record text.
Under normal conditions, original record may be from different data sources.For example, in staple commodities transaction platform, number According to may be from different suppliers, and there is the data template of oneself in each supply of material commercial city, and data format and form of presentation are difficult to It is unified.By taking field name as an example, the field name for storing commodity pricing information can be called " price ", can also be called " unit price ".The poor compatibility of this data from different data sources from each other.Therefore need it is nonstandard to these formats, Nonstandard data are standardized and normalized, generate general record text, are handled and interacted to facilitate.This Shen Please embodiment record processing method be suitable for never limit pattern electrical form extract target record scene, and be applicable in In the subsequent scene that non-standard record field value is normalized to standard value collection (Value Set) item.
Before nonstandard original record is normalized, it is necessary first to identify the gauge outfit pattern of original record. In step s 110, the gauge outfit row in original record is identified.Identification gauge outfit pattern is the basis that trailer record is extracted.In step In S120, record row is extracted from original record based on the gauge outfit pattern identified.It can be from the original to be processed identified The gauge outfit for beginning to record starts to search for downwards, until search terminates or search other gauge outfit rows cut-offs.The record that will be searched Capable data pick-up comes out.
In order to carry out record standard and normalized, the field value collection database of standard can be preset.Field value The centrally stored title for having criteria field.Wherein, the title of criteria field is alternatively referred to as " standard word segment value " or " value collection item ".Example Such as, " price " can be called for storing the field name of commodity pricing information in original record, " unit price " can also be called.It can Field value is centrally stored these fields are normalized after criteria field title.Such as it can be by original field In " price ", " unit price ", " price ", criteria field " price " is unified for after normalized.
In step s 130, the original field in original record is matched with preset criteria field.Namely will The title of original field is matched with the title of criteria field.For example, by the title " unit price " of original field and field value collection The title of the criteria field of middle storage is matched one by one, to identify criteria field corresponding with original field.Such as following table Shown in 1- table 4, Tables 1 and 2 is original record table example, and table 3 is the general note after normalized corresponding with table 1 Table example is recorded, table 4 is the general record table example after normalized corresponding with table 2.Wherein, the original in table 1 It is " cold rolled strip coil " that beginning field " the compound cold rolling coil of stainless steel ", which concentrates the title of corresponding criteria field in field value,;In table 2 It is also " cold rolled strip coil " that original field " aluminum-zinc alloy resistance and fingerprint resistance coiled sheet ", which concentrates the title of corresponding criteria field in field value,;Table 2 In original field " unit price " to concentrate the title of corresponding criteria field in field value be " price ".
1 original record table example one of table
Record row 1 Product name Price
Record row 2 The compound cold rolling coil of stainless steel 180
2 original record table example two of table
Record row 1 Product name Unit price
Record row 2 Aluminum-zinc alloy resistance and fingerprint resistance coiled sheet 160
3 general record table example one of table
Record row 1 Product name Price
Record row 2 Cold rolled strip coil 180
4 general record table example two of table
Record row 1 Product name Price
Record row 2 Cold rolled strip coil 160
In step S140, in the record row extracted, corresponding original word is replaced with the criteria field of successful match Section.As shown in table 1- table 4, " the compound cold rolling coil of stainless steel " and " aluminum-zinc alloy resistance and fingerprint resistance coiled sheet " are replaced with " cold rolled strip coil ", with " valence Lattice " replacement " unit price " generates general record text.
In the embodiment of the present application, solution new record can be automated, and carry out record field standardization, unification is provided Treatment effeciency can be substantially improved in the record data of reference format, save human resources.
Fig. 2 is the flow chart according to the identification gauge outfit pattern of the general record processing method of the embodiment of the present application.Such as Fig. 2 institute Show, in one embodiment, the step S110 in Fig. 1 identifies the gauge outfit pattern of original record, comprising:
Step S210 determines the gauge outfit line range of original record;
Step S220, in gauge outfit line range, by the aiming field and preset gauge outfit keyword phase in each record row Matching;
Step S230 is recording the situation successful with corresponding gauge outfit Keywords matching of all aiming fields in row Under, determine the accurate successful match of record row;
Step S240, using the record row of accurate successful match as gauge outfit row.
Original record is standardized and normalized before, first have to identification original record gauge outfit pattern, Namely to position the position of gauge outfit row.In step S210, it is first determined the gauge outfit row of original record sheet (table) content Range.In one example, gauge outfit line range: max (40,20%*sheet total line number) can be calculated with following sentence.Wherein, Max is the function for choosing maximum value;" the total line number of 20%*sheet " indicates the 20% of the total line number of table, may specify this part line number Position in the table, such as be usually the 20% of table top;The meter of " max (40,20%*sheet total line number) " sentence It calculates the result is that choosing a maximum value in " the total line number of 20%*sheet " and " 40 ".For example, the total line number of sheet is 100 rows, then " the total line number of 20%*sheet " is 20 rows, then a maximum value is chosen in " the total line number of 20%*sheet " and " 40 " is then 40, Therefore the value of " max (40,20%*sheet total line number) " is 40.
In step S220, in the gauge outfit line range of the sheet content of above-mentioned determination, each note in original field is extracted The title of column field in record row.Wherein, the title of the column field in original field in gauge outfit line range is known as aiming field. On the other hand, the gauge outfit keyword set database of standard can be preset, be stored in gauge outfit keyword set database to target Field be normalized after gauge outfit keyword.Such as the aiming field of row 1 is recorded in original record in table 1 above For " price ", the gauge outfit keyword in gauge outfit keyword set is also " price ", it is determined that aiming field and gauge outfit keyword are accurate Successful match.
In step S230, if all aiming fields in record row are successful with corresponding gauge outfit Keywords matching, Determine the accurate successful match of record row.Such as in above-mentioned table 1 and table 3, another target word of row 1 is recorded in 1 original record of table Section is " product name ", and the gauge outfit keyword in gauge outfit keyword set is also that " product name ", the then aiming field and gauge outfit are closed Keyword also accurate successful match.In 1 original record of table record row 1 in all aiming fields with corresponding gauge outfit keyword Successful match, it is determined that the accurate successful match of record row 1 in table 1.In step S240, by the record of accurate successful match Row, such as the record row 1 in table 1, as gauge outfit row.
It is the important of progress charting batch positioning by accurate match cognization gauge outfit pattern in the embodiment of the present application Link provides basis on location for the step of subsequent extraction record.
Fig. 3 is the flow chart according to the identification gauge outfit pattern of the general record processing method of the embodiment of the present application.Such as Fig. 3 institute Show, in one embodiment, the method also includes:
Step S310 calculates the first matching degree in the case where the record row in gauge outfit line range accurately matches unsuccessful situation Mixing index, the first matching degree mixing index are the matching degrees of the aiming field and preset gauge outfit keyword in each record row Mixing index;
Step S320, all aiming fields in record row are mixed with the first matching degree of corresponding gauge outfit keyword In the case that index is more than or equal to the first preset threshold, record row fuzzy matching success is determined;
Fuzzy matching is successfully recorded row as gauge outfit row by step S330.
In this embodiment, if the current record row identified accurately matches unsuccessful, that is, current note Record row is not matched to the corresponding keyword of all aiming fields accurately, then relaxes matching criteria, by the mesh in each record row Marking-up section and preset gauge outfit keyword carry out mush matching.For example, in step s310, following formula can be used and calculate the One matching degree mixing index:
Wherein, score indicates the score value of the calculated result of the first matching degree mixing index and the second matching degree mixing index, LCS indicates that longest common subsequence algorithm, ED indicate that editing distance, function len (LCS (x, y)) are used for calculating character string x and y Longest common subsequence length, function len (z) be used for calculating character string z length, ω indicate weighted value, SrawIndicate former Character string, StargetIndicate target string.
LCS is the abbreviation of Longest Common Subsequence, i.e. longest common subsequence.One sequence, if It is the subsequence of two or more known arrays, and is longest in all subsequences, then is longest common subsequence.In formula The value of ω can choose optimum data according to experimental result, such as the numerical value of local optimum is chosen using hill-climbing algorithm.
In the first matching degree mixing index, former character string can be the aiming field in each record row, target character String can be gauge outfit keyword.
In step s 320, when determining that the first matching degree mixing index is more than or equal to the first preset threshold, target word is determined The corresponding gauge outfit keyword fuzzy matching success of Duan Junyu.If all aiming fields in current record row with gauge outfit keyword At least one gauge outfit keyword fuzzy matching success in collection, it is determined that current record row fuzzy matching success.In step S330 In, it will determine as the successful current record row of fuzzy matching as gauge outfit row.
For example, aiming field in above-mentioned table 2 in original record is " unit price ", it is right with " unit price " in gauge outfit keyword set The gauge outfit keyword answered is that " price ", then " unit price " and " price " can fuzzy matching successes.Record row 1 in 2 original record of table In the corresponding matching degree mixing index of all aiming fields be all larger than equal to the first preset threshold, it is determined that the record in table 2 1 fuzzy matching of row success.It regard the record row 1 in table 2 as gauge outfit row.
In the embodiment of the present application, gauge outfit pattern is identified by fuzzy matching, improves fault-tolerant ability, for normative poor Data can also reach preferable recognition effect.
Fig. 4 is the flow chart according to the extraction record row of the general record processing method of the embodiment of the present application.Such as Fig. 4 institute Show, in one embodiment, the step S120 in Fig. 1 extracts record row, packet based on gauge outfit pattern from original record It includes:
Step S410 regard the corresponding column serial number distribution of effective column data in original record as record rule;
Step S420 extracts record row according to record rule and gauge outfit pattern from original record.
In this embodiment, record rule is generated with the following method: pre-establishing standard recording form.With the source of goods For record, can formulate the gauge outfit keyword in standard recording form includes: product name, rank lattice, the place of production, date of manufacture.With Standard recording form compares, and identifies column data invalid in source of goods original record table, also says it is identification garbage.With mark The different column data of quasi- record form is considered as garbage.The effective corresponding field of column data is known as aiming field, by mesh The distribution of marking-up section column serial number is used as record rule.For example, the column data in source of goods original record table includes: the 1st column: trade name Claim, the 2nd column: merchandise classification, the 3rd column: price, the 4th column: appearance color, the 5th column: the place of production, the 6th column: commercial grade, the 7th column: Date of manufacture.Then compared with standard recording form, the 1st, 3,5,7 column in original source of goods record form are determined as target word Section column serial number.
In step S410, all gauge outfits in original record are analyzed, are determined in each gauge outfit, aiming field column sequence Number, it regard the distribution of aiming field column serial number as record rule.In the step s 420, record search, root are carried out in original record Every record is extracted according to gauge outfit pattern and record rule.
In one example, the step of record search is as follows:
(1) the column range of every sublist in original record table is calculated according to record rule.Such as it is original in the above-mentioned source of goods In the example of record form, it is targeted the 1st, 3,5, the 7 of field column serial number and is classified as the original record table for implementing search Column range.
(2) it is searched for downwards from the gauge outfit row of each sublist, collects candidate record row, until current line reaches the bottom sheet Search is completed, alternatively, completing search when the gauge outfit row or column range with other sublists clashes.
(3) candidate record row obtained in step (2) is filtered, the effective of necessary aiming field is included in reservation line The row of content is as record row.For example, " name of an article " contains at least one Chinese character, " trade mark " is not sky, and " place of production " includes at least one A Chinese character.
(4) according to record rule, record row is converted into standard recording form, for example, column field include: " (product name, Price, the place of production, date of manufacture) ".
In the embodiment of the present application, positioned carrying out charting batch based on gauge outfit pattern and record rule, it is basic herein The upper subsequent Text normalization processing of progress can ensure that the validity and standardization of data, and improve treatment effeciency.
Fig. 5 is the matched flow chart according to the general record processing method of the embodiment of the present application.As shown in figure 5, one In kind embodiment, in the step S130 in Fig. 1, by the original field and the progress of preset criteria field in original record Match, comprising:
The history match of original field and the successful match of criteria field is recorded write-in caching by step S510;
Step S520, if the original field successful match in current original field to be matched and history match record, Determine that current original field to be matched and criteria field pass through cache match success.
It in step s 130, will be former by matching the original field in original record with preset criteria field Nonstandard field value is stated in beginning record to be aligned with standard word segment value.
In step S510, during original record table is normalized, by original field and standard Field is matched, and caching is written in the result of successful match.It include original field in the history match record of write-in caching With the mapping relations of criteria field.
In step S520, query caching is confirmed whether the original field of existing successful match and reflecting for criteria field Penetrate relationship.If the mapping relations of existing current original field and criteria field to be matched, it is determined that current original to be matched Beginning field and criteria field pass through cache match success.For example, having existed original field " unit price " in history match record With the mapping relations of criteria field " price ".If current original field to be matched is also " unit price ", i.e., current original to be matched Original field successful match in beginning field and history match record, it is determined that right with current original field " unit price " to be matched The criteria field answered is " price ".
In the embodiment of the present application, processing speed can be promoted with criteria field by the original field of Data Matching in caching, Lifting system performance.
Fig. 6 is the matched flow chart according to the general record processing method of the embodiment of the present application.As shown in fig. 6, one In kind embodiment, the method also includes:
Step S610, in the case where current original field and criteria field to be matched are by the unsuccessful situation of cache match, Current original field to be matched is matched with the criteria field that preset field value is concentrated;
Step S620, if the criteria field successful match that current original field to be matched and preset field value are concentrated, Then determine that current original field to be matched and criteria field pass through field value collection successful match.
For example, if not inquiring in step S520 has current original field to be matched and mark in history match record The mapping relations of quasi- field, then the value collection item that enumerated field value is concentrated in step S610, by current original field to be matched The value collection item concentrated with field value is accurately matched.In step S620, if current original field and field value to be matched The accurate successful match of value collection item of concentration, such as the row record in record form are as follows: " product name: cold rolled strip coil ", " place of production: Also there are " cold rolled strip coil ", " Shanghai " in Shanghai " in the value collection item that field value is concentrated, it is determined that current original field to be matched Pass through field value collection successful match with criteria field.
In the embodiment of the present application, original field and criteria field are matched by field value collection, it can be ensured that matched accuracy, And then improve the accuracy for generating data.
Fig. 7 is the matched flow chart according to the general record processing method of the embodiment of the present application.As shown in fig. 7, one In kind embodiment, the method also includes:
Step S710 matches unsuccessful situation by field value collection with criteria field in current original field to be matched Under, current original field to be matched is matched with the alias of the criteria field in default rule library, wherein rule base For storing the mapping relations between criteria field and the alias of criteria field;
Step S720, if the alias match of the criteria field in current original field and default rule library to be matched at Function, it is determined that current original field and criteria field to be matched pass through rule base successful match.
For example, if the value collection item that current original field and field value to be matched are concentrated in step S620 does not have accurate With success, then error-correction rule list is enumerated in step S710, carry out the accurate matching of error-correction rule, by current original to be matched Beginning field is accurately matched with the alias of the criteria field in rule base.Wherein, alias is the title of official name symbol or specification Title in addition.By taking the place of production is " Shanghai " as an example, if criteria field is " Shanghai ", alias may be " big Shanghai " or " Shanghai " etc..
In step S720, if current original field to be matched is accurately matched with the alias of the criteria field in rule base Success, such as the field record in original record table are as follows: " place of production: Shanghai " has in the alias of the criteria field in rule base: Criteria field " place of production: Shanghai ", alias " place of production: Shanghai ", that is, can determine original word section to criteria field mapping relations.Use standard Field " place of production: Shanghai " replaces original field " place of production: Shanghai ".
In the embodiment of the present application, by regular storehouse matching original word section and criteria field, due to rule base be stored with it is original Mapping relations between field and the alias of criteria field make data have compatibility, improve the processing capacity of system.
Fig. 8 is the matched flow chart according to the general record processing method of the embodiment of the present application.As shown in figure 8, one In kind embodiment, the method also includes:
Step S810 passes through the unsuccessful situation of regular storehouse matching in current original field and criteria field to be matched Under, the second matching degree mixing index is calculated, the second matching degree mixing index is current original field to be matched and field value collection In criteria field matching degree mixing index;
Step S820, the second matching degree mixing index be more than or equal to the second preset threshold in the case where, determine currently to Matched original field and criteria field fuzzy matching success.
For example, if the alias of current original field and the criteria field in rule base to be matched does not have in step S720 Accurate successful match then calculates the second matching degree mixing index using following formula in step S810:
Wherein, score indicates the score value of the calculated result of the first matching degree mixing index and the second matching degree mixing index, LCS indicates that longest common subsequence algorithm, ED indicate that editing distance, function len (LCS (x, y)) are used for calculating character string x and y Longest common subsequence length, function len (z) be used for calculating character string z length, ω indicate weighted value, SrawIndicate former Character string, StargetIndicate target string.
LCS is the abbreviation of Longest Common Subsequence, i.e. longest common subsequence.One sequence, if It is the subsequence of two or more known arrays, and is longest in all subsequences, then is longest common subsequence.In formula The value of ω can choose optimum data according to experimental result, such as choose the numerical value of part preferentially using hill-climbing algorithm.
In the second matching degree mixing index, former character string is current original field to be matched, and target string is word The criteria field that segment value is concentrated.
In step S820, if it is decided that some criteria field that current original field and field value to be matched are concentrated When matching degree mixing index is more than or equal to the second preset threshold, current original field to be matched and criteria field fuzzy are determined With success.
In one embodiment, building knowledge mapping can be concentrated in field value, to indicate that each value that field value is concentrated collects The topological relation of item.Knowledge mapping is made of some entities interconnected and their attribute.Knowledge mapping is by one Rule knowledge composition, every knowledge can be expressed as a SPO triple (Subject-Predicate-Object), wherein Subject indicates that subject, Predicate indicate that predicate, Object indicate object.The collection of knowledge can be shared and be opened up by knowledge mapping Figure is flutterred to describe, the abstraction relation being suitble between expression knowledge entity.The abstract-association search being usually used between knowledge entity.
In the embodiment of the present application, knowledge entity is that various values collect Xiang Wenben, and one of entity attributes are the literal content of text, Relationship between entity is similar or non-similar.By taking steel commodity as an example, to value of variety collection, similar entity possesses identical kind Major class, such as plate, tubing, cut deal;To place of production entity, similar entity possesses identical group, steel mill, such as precious military, horse Steel, Handan Iron and Steel Co;To warehouse entity, similar entity possesses area belonging to identical address, such as Shanghai, Wuhan, Nanjing.
Each group of similar entity forms a knowledge entity cluster, and each cluster has a central entity.Pass through bulk sampling Source of goods record counts the frequency of occurrences of all values collection item, the highest entity quilt of the respective value collection item frequency of occurrences in each entity cluster It is selected as cluster central entity.Such as in figures 9 b and 9, " cold rolled strip coil " is cluster central entity, " the compound cold rolling coil of stainless steel ", " aluminum-zinc alloy Resistance and fingerprint resistance coiled sheet ", " color coating coiled sheet (electric zinc-base plate) " are the non-central entities similar with " cold rolled strip coil ".
During executing step S820, the value collection item of enumerated field value concentration can be traversed, original field and value are measured Collect the matching degree mixing index of item text, synthesis determines whether successful match.The sequence that above-mentioned traversal is enumerated can collect item by value and exist Topological relation in KG (Knowledge Graph, knowledge mapping) determines.In one example, to simplify searching entities process, Remove the similar relation in all clusters between non-central entity, the central entity value collection item and record pair of different clusters in traversal KG first Answer matching degree mixing index.If original field does not have successful match from the central entity of different clusters, according still further to original field with The descending order of the matching degree mixing index of the central entity of each difference cluster, obtains with original field matching degree mixing index most Then big central entity successively matches the non-central entity in KG in cluster where the central entity.
Fig. 9 A and Fig. 9 B are the knowledge mapping schematic diagrames according to the general record processing method of the embodiment of the present application.9A and figure Numerical value on line in 9B indicates the matching degree mixing index between the noun at the line both ends being calculated.Such as 9A and Fig. 9 B Shown, the original field in original record is " cold rolling ", then calculates the matching of original field from the central entity of different clusters first Spend mixing index.If without successful match, by the matching degree mixing index of original field and the central entity of each different clusters Descending sort.Referring to the example of Fig. 9 A, the central entity in knowledge mapping includes " cold rolled strip coil " and " general line ".By calculating It obtains, the matching degree mixing index of " cold rolling " and " cold rolled strip coil " is 0.5, and " cold rolling " and the matching degree mixing index of " general line " are 0.0.Then the result of descending sort is the matching degree mixing index highest of " cold rolling " and " cold rolled strip coil ".Referring back to showing for Fig. 9 B Example successively matches the non-central entity in KG centered on " cold rolled strip coil ", i.e., by " cold rolling " and centered on " cold rolled strip coil " Non-central entity matched.Namely calculate separately " cold rolling " and " the compound cold rolling coil of stainless steel ", " aluminum-zinc alloy fingerprint-proof plate The matching degree mixing index of volume ", " color coating coiled sheet (electric zinc-base plate) ".Referring to Fig. 9 B, " cold rolling " and " color coating coiled sheet (electric zinc-base Plate) " matching degree mixing index be 0.0, the matching degree mixing index of " cold rolling " and " aluminum-zinc alloy resistance and fingerprint resistance coiled sheet " is 0.0, " cold Roll " it with the matching degree mixing index of " the compound cold rolling coil of stainless steel " is 0.25.Preceding step has calculated that " cold rolling " and " cold rolling The matching degree mixing index of coiled sheet " is 0.5, then in these indexs, the matching degree mixing index of " cold rolling " and " cold rolled strip coil " Maximum, if the matching degree mixing index is more than or equal to the second preset threshold, it is determined that current original field " cold rolling " to be matched With criteria field " cold rolled strip coil " fuzzy matching success.
In the embodiment of the present application, fault-tolerant ability is improved using the method that matching degree mixing index carries out fuzzy matching, it is right Preferable data normalization effect can also be reached in normative poor data.
In one embodiment, if original field and criteria field successful match, by the original field of successful match The mapping relations of value and value collection item, write-in caching.Buffer memory capacity overflow when using LRU (Least Recently Used, recently It is minimum to use) algorithm is scheduled.Lru algorithm is used for scheduling memory scene, it can also be used to buffer scheduling scene.With buffer scheduling For scene, i.e., when spatial cache is filled and is overflowed, from being currently hit in the least element of number in caching, select one A element removes caching, and new element is added into caching.
Each step shown in fig. 8 may execute serially, can also be according to the redirect procedure of implementing result setting program.Such as it holds Row step S520 query caching continues to execute step S610 if inquiring failure;Or execution step S520 successful inquiring is then straight It connects and jumps to step S510, the mapping relations of the original field value of successful match and value collection item are written and are cached.Similarly, it executes Step S610 and step S620 enumerated value collection item are matched, and step S710 is continued to execute if it fails to match;Or matching at Function then jumps directly to step S510, and the mapping relations of the original field value of successful match and value collection item are written and are cached.With this Analogize, execution step S710 and step S720 enumerates error-correction rule list and matched, and step is continued to execute if it fails to match S810;Or successful match then jumps directly to step S510, and the mapping of the original field value of successful match and value collection item is closed System's write-in caching.When executing step S820, if successful match also jumps directly to step S510.To sum up, each step in Fig. 8 Sequence executes, and if jumping to step S510 if some step successful match wherein, otherwise continuation sequence is performed the next step Suddenly.
During above-mentioned naturalization processing, it is possible that the case where result cannot be normalized, that is, normalize Failure.It can be preset whether to allow to normalize and fail.It in one example, can in the case where not allowing to normalize the scene to fail Use and result is normalized as record field value according to the default value of great amount of samples statistical result setting.In this illustration, may be used It is directed to each original field in advance, statistics criteria field corresponding with each original field in the general record of history trendline, And count the number of the successful match of original field and corresponding criteria field.For example, original field is " ProductName: banana apple Fruit ", in the general record of history trendline, any " ProductName " original field and " ProductName: banana " successful match accumulative 50 Secondary, any " ProductName " original field and " ProductName: apple " successful match are 500 times accumulative, then choose successful match number most More corresponding fields is set as default value, the normalization result by " ProductName: apple " as " ProductName: sweet apple ".On It states in the setting method of default value, does not distinguish original field, select history is accumulative to be hit most multiple " ProductName " as returning One change default value, this method from probability for normalize a possibility that correct maximum.
In the examples described above, in the case where allowing to normalize the scene to fail, it can be used normalization failure identifier as normalizing Change result.
Figure 10 is the module design and data flow diagram according to the general record processing method of the embodiment of the present application.Such as Shown in Figure 10, the embodiment of the present application utilizes text fuzzy matching technology and rule base, constructs note general in specific transactions scene Record parsing and normalization system.The system includes record search module, normalization module and data management module.Wherein, it records Search module uses charting batch location algorithm, takes out in batches from all sheet of the electronic form file of original record Take original record.Record search module includes gauge outfit search module 1, record rule generation module 2 and record search module 4.Return One change module includes gauge outfit field relating module 3 and record normalization module 7.Gauge outfit search module gauge outfit pattern for identification, To obtain original table head file.Gauge outfit field relating module is analyzed the gauge outfit in original table head file and gauge outfit keyword set and is closed Relevant original table head file and gauge outfit keyword are divided into one group, obtain packet associated by the incidence relation of keyword Original table head file.Record rule generation module compares the original table head file of original table head file and packet associated, generates note Record rule.Record search module is extracted from all sheet of the electronic form file of original record in batches according to record rule Original record.
Referring to Figure 10, original record and normalize rule that record normalization module is extracted according to record search module The specific field value collection item information that normalization rule and field value in library are concentrated, generates normalization record, and generate normalization Record list.Wherein, gauge outfit field relating module and record normalization module using LCS, editing distance, knowledge mapping KG and Lru algorithm carries out the association of gauge outfit field or record normalized.
Referring to Figure 10, data management module includes that value collection management module 6 and error-correction rule management module 5, the module are used for Maintenance value collection data and error-correction rule data.
Value collection management module offer value collection management function safeguards several renewable standard word segment value (value collects item) collection It closes.Specifically include value collection item addition interface, value collection entry deletion interface, value collection item modification interface, value collection item query interface.
Error-correction rule management module provides error-correction rule management function, that is, safeguards several original field values to criteria field The mapping relations set of value (value collection item), solution matching degree mixing index, which can not be handled, there is the other of criteria field in original field The problem of name.Specifically include error-correction rule addition interface, error-correction rule deletes interface, error-correction rule modification interface, error-correction rule Query interface.User can pass through the alias of the customized criteria field of above-mentioned interface.
Figure 11 is the general record processing device structure diagram according to the embodiment of the present application.As shown in figure 11, the application The general record processing unit of embodiment includes:
Recognition unit 100, for identification the gauge outfit pattern of original record;
Extracting unit 200, is used for: record row is extracted from original record based on gauge outfit pattern;
Matching unit 300, for matching the original field in original record with preset criteria field;
Generation unit 400, is used for: in the record row extracted, replacing corresponding original with the criteria field of successful match Beginning field generates general record text.
Figure 12 is the general record processing device structure diagram according to the embodiment of the present application.As shown in figure 12, in one kind In embodiment, recognition unit 100 includes the first identification subelement 110, and the first identification subelement 110 is used for:
Determine the gauge outfit line range of original record;
In gauge outfit line range, the aiming field in each record row is matched with preset gauge outfit keyword;
All aiming fields in record row determine record and in the corresponding successful situation of gauge outfit Keywords matching The accurate successful match of row;
Using the record row of accurate successful match as gauge outfit row.
In one embodiment, recognition unit 100 further includes the second identification subelement 120, the second identification subelement 120 For:
In the case where the record row in gauge outfit line range accurately matches unsuccessful situation, the first matching degree mixing index is calculated, First matching degree mixing index is the matching degree mixing index of the aiming field and preset gauge outfit keyword in each record row;
All aiming fields in record row are greater than with the first matching degree mixing index of corresponding gauge outfit keyword In the case where the first preset threshold, record row fuzzy matching success is determined;
Fuzzy matching is successfully recorded to row as gauge outfit row.
In one embodiment, extracting unit 200 is used for:
It regard the corresponding column serial number distribution of effective column data in original record as record rule;
Record row is extracted from original record according to record rule and gauge outfit pattern.
Figure 13 is the general record processing device structure diagram according to the embodiment of the present application.As shown in figure 13, in one kind In embodiment, matching unit 300 includes the first coupling subelement 310, and the first coupling subelement 310 is used for:
The history match of original field and the successful match of criteria field is recorded into write-in caching;
If the original field successful match in current original field to be matched and history match record, it is determined that currently to Matched original field and criteria field pass through cache match success.
In one embodiment, matching unit 300 further includes the second coupling subelement 320, the second coupling subelement 320 For:
In the case where current original field and criteria field to be matched are by the unsuccessful situation of cache match, will currently to The original field matched is matched with the criteria field that preset field value is concentrated;
If the criteria field successful match that current original field to be matched and preset field value are concentrated, it is determined that current Original field and criteria field to be matched passes through field value collection successful match.
In one embodiment, matching unit 300 further includes third coupling subelement 330, third coupling subelement 330 For:
It, will be current in the case where current original field to be matched matches unsuccessful situation by field value collection with criteria field Original field to be matched is matched with the alias of the criteria field in default rule library, wherein rule base is for storing Mapping relations between criteria field and the alias of criteria field;
If the alias match success of the criteria field in current original field and default rule library to be matched, it is determined that Current original field and criteria field to be matched pass through rule base successful match.
In one embodiment, matching unit 300 further includes the 4th coupling subelement 340, the 4th coupling subelement 340 For:
In the case where current original field and criteria field to be matched pass through the unsuccessful situation of regular storehouse matching, second is calculated Matching degree mixing index, the second matching degree mixing index are current original field to be matched and the criteria field that field value is concentrated Matching degree mixing index;
In the case where the second matching degree mixing index is more than or equal to the second preset threshold, determine current to be matched original Field and criteria field fuzzy matching success.
The function of each unit in the general record processing unit of the embodiment of the present application may refer to pair in the above method It should describe, details are not described herein.
According to an embodiment of the present application, present invention also provides a kind of electronic equipment and a kind of readable storage medium storing program for executing.
As shown in figure 14, be according to the general record of the embodiment of the present application handle method electronic equipment block diagram.Electricity Sub- equipment is intended to indicate that various forms of digital computers, such as, laptop computer, desktop computer, workbench, a number Word assistant, server, blade server, mainframe computer and other suitable computer.Electronic equipment also may indicate that respectively The mobile device of kind form, such as, personal digital assistant, cellular phone, smart phone, wearable device and other similar meters Calculate device.Component, their connection and relationship shown in this article and their function are merely exemplary, and are not intended to Limit the realization of the application that is described herein and/or requiring.
As shown in figure 14, which includes: one or more processors 1401, memory 1402, and for connecting Connect the interface of each component, including high-speed interface and low-speed interface.All parts are interconnected using different bus, and can be with It is installed on public mainboard or installs in other ways as needed.Processor can be to the finger executed in electronic equipment Order is handled, including storage in memory or on memory (such as, to be coupled to and connect in external input/output device Mouthful display equipment) on show graphic user interface (Graphical User Interface, GUI) graphical information finger It enables.In other embodiments, if desired, by multiple processors and/or multiple bus and multiple memories and multiple can deposit Reservoir is used together.It is also possible to connect multiple electronic equipments, each equipment provides the necessary operation in part (for example, as clothes Business device array, one group of blade server or multicomputer system).In Figure 14 by taking a processor 1401 as an example.
Memory 1402 is non-transitory computer-readable storage medium provided herein.Wherein, memory stores There is the instruction that can be executed by least one processor, so that at least one processor executes at general record provided herein The method of reason.The non-transitory computer-readable storage medium of the application stores computer instruction, and the computer instruction is based on making The method that calculation machine executes general record processing provided herein.
Memory 1402 be used as a kind of non-transitory computer-readable storage medium, can be used for storing non-instantaneous software program, Non-instantaneous computer executable program and module, the corresponding program of method handled such as the general record in the embodiment of the present application Instruction/module/unit is (for example, recognition unit 100, extracting unit 200 shown in attached drawing 11, matching unit 300, generation unit 400, the first identification subelement 110, second shown in attached drawing 12 identifies the first matching shown in subelement 120 and attached drawing 13 Subelement 310, the second coupling subelement 320, third coupling subelement 330, the 4th coupling subelement 340).Processor 1401 is logical Non-instantaneous software program, instruction and module that operation is stored in memory 1402 are crossed, thereby executing the various function of server It can apply and data processing, i.e. the method for general record processing in realization above method embodiment.
Memory 1402 may include storing program area and storage data area, wherein storing program area can store operation system Application program required for system, at least one function;Storage data area can be stored to be set according to the electronics of general record processing method Standby uses created data etc..In addition, memory 1402 may include high-speed random access memory, it can also include non- Volatile storage, for example, at least a disk memory, flush memory device or other non-instantaneous solid-state memories.Some In embodiment, optional memory 1402 includes the memory remotely located relative to processor 1401, these remote memories can To pass through the electronic equipment of network connection to general record processing method.The example of above-mentioned network include but is not limited to internet, Intranet, local area network, mobile radio communication and combinations thereof.
The electronic equipment of general record processing method can also include: input unit 1403 and output device 1404.Processing Device 1401, memory 1402, input unit 1403 and output device 1404 can be connected by bus or other modes, Figure 14 In by by bus connect for.
Input unit 1403 can receive the number or character information of input, and generate the electricity with general record processing method The related key signals input of the user setting and function control of sub- equipment, such as touch screen, keypad, mouse, track pad, touching The input units such as template, indicating arm, one or more mouse button, trace ball, control stick.Output device 1404 may include Show equipment, auxiliary lighting apparatus (for example, LED) and haptic feedback devices (for example, vibrating motor) etc..The display equipment can be with Including but not limited to, liquid crystal display (Liquid Crystal Display, LCD), light emitting diode (Light Emitting Diode, LED) display and plasma scope.In some embodiments, display equipment can be touch screen.
The various embodiments of system and technology described herein can be in digital electronic circuitry, integrated circuit system System, is consolidated specific integrated circuit (Application Specific Integrated Circuits, ASIC), computer hardware It is realized in part, software, and/or their combination.These various embodiments may include: to implement in one or more calculating In machine program, which can hold in programmable system containing at least one programmable processor Row and/or explain, which can be dedicated or general purpose programmable processors, can from storage system, at least One input unit and at least one output device receive data and instruction, and data and instruction is transmitted to the storage system System, at least one input unit and at least one output device.
These calculation procedures (also referred to as program, software, software application or code) include the machine of programmable processor Instruction, and can use programming language, and/or the compilation/machine language of level process and/or object-oriented to implement these Calculation procedure.As used herein, term " machine readable media " and " computer-readable medium " are referred to for referring to machine It enables and/or data is supplied to any computer program product, equipment, and/or the device of programmable processor (for example, disk, light Disk, memory, programmable logic device (programmable logic device, PLD)), including, receiving can as machine The machine readable media of the machine instruction of read signal.Term " machine-readable signal " is referred to for by machine instruction and/or number According to any signal for being supplied to programmable processor.
In order to provide the interaction with user, system and technology described herein, the computer can be implemented on computers Include for user show information display device (for example, CRT (Cathode Ray Tube, cathode-ray tube) or LCD (liquid crystal display) monitor);And keyboard and indicator device (for example, mouse or trace ball), user can be by this Keyboard and the indicator device provide input to computer.The device of other types can be also used for providing the friendship with user Mutually;For example, the feedback for being supplied to user may be any type of sensory feedback (for example, visual feedback, audio feedback or Touch feedback);And it can be received with any form (including vocal input, voice input or tactile input) from user Input.
System described herein and technology can be implemented including the computing system of background component (for example, as data Server) or the computing system (for example, application server) including middleware component or the calculating including front end component System is (for example, the subscriber computer with graphic user interface or web browser, user can pass through graphical user circle Face or the web browser to interact with the embodiment of system described herein and technology) or including this backstage portion In any combination of computing system of part, middleware component or front end component.Any form or the number of medium can be passed through Digital data communicates (for example, communication network) and is connected with each other the component of system.The example of communication network includes: local area network (Local Area Network, LAN), wide area network (Wide Area Network, WAN) and internet.
Computer system may include client and server.Client and server is generally off-site from each other and usually logical Communication network is crossed to interact.By being run on corresponding computer and each other with the meter of client-server relation Calculation machine program generates the relationship of client and server.
According to the technical solution of the embodiment of the present application, interest directly is identified from the related content of user information behavior Point, so that ensure that can be intended to match for the point of interest of user's push with user, user experience is good.Because directly from Point of interest is identified in the related content of family information behavior, so the point of interest for avoiding push is unsatisfactory for asking for the needs of user Topic, and then the user experience is improved.
It should be understood that various forms of processes illustrated above can be used, rearrangement increases or deletes step.Example Such as, each step recorded in the application of this hair can be performed in parallel or be sequentially performed the order that can also be different and execute, As long as it is desired as a result, being not limited herein to can be realized technical solution disclosed in the present application.
Above-mentioned specific embodiment does not constitute the limitation to the application protection scope.Those skilled in the art should be bright White, according to design requirement and other factors, various modifications can be carried out, combination, sub-portfolio and substitution.It is any in the application Spirit and principle within made modifications, equivalent substitutions and improvements etc., should be included within the application protection scope.

Claims (18)

1. a kind of general record processing method characterized by comprising
Identify the gauge outfit pattern of original record;
Record row is extracted from the original record based on the gauge outfit pattern;
Original field in the original record is matched with preset criteria field;
In the record row extracted, corresponding original field is replaced with the criteria field of successful match, generates general note Record text.
2. the method according to claim 1, wherein the gauge outfit pattern of identification original record, comprising:
Determine the gauge outfit line range of the original record;
In the gauge outfit line range, the aiming field in each record row is matched with preset gauge outfit keyword;
All aiming fields in record row determine record row essence and in the corresponding successful situation of gauge outfit Keywords matching True successful match;
Using the record row of accurate successful match as gauge outfit row.
3. according to the method described in claim 2, it is characterized in that, the method also includes:
In the case where the record row in the gauge outfit line range accurately matches unsuccessful situation, the first matching degree mixing index is calculated, The first matching degree mixing index is that the aiming field in each record row is mixed with the matching degree of preset gauge outfit keyword Index;
All aiming fields in record row are greater than with the first matching degree mixing index of corresponding gauge outfit keyword In the case where the first preset threshold, record row fuzzy matching success is determined;
Fuzzy matching is successfully recorded to row as gauge outfit row.
4. method according to any one of claim 1-3, which is characterized in that based on the gauge outfit pattern from described original Record row is extracted in record, comprising:
It regard the corresponding column serial number distribution of effective column data in the original record as record rule;
Record row is extracted from the original record according to the record rule and the gauge outfit pattern.
5. method according to any one of claim 1-3, which is characterized in that by the original field in the original record It is matched with preset criteria field, comprising:
By the history match of the original field and the successful match of criteria field record write-in caching;
If the original field successful match in current original field to be matched and history match record, it is determined that described to work as Preceding original field to be matched and the criteria field pass through cache match success.
6. according to the method described in claim 5, it is characterized in that, the method also includes:
It, will be described in the case where the current original field to be matched and the criteria field are by the unsuccessful situation of cache match Current original field to be matched is matched with the criteria field that preset field value is concentrated;
If the criteria field successful match that the current original field to be matched and preset field value are concentrated, it is determined that described Current original field to be matched and the criteria field pass through field value collection successful match.
7. according to the method described in claim 6, it is characterized in that, the method also includes:
It, will in the case where the current original field to be matched matches unsuccessful situation by field value collection with the criteria field The current original field to be matched is matched with the alias of the criteria field in default rule library, wherein the rule Then library is used to store the mapping relations between the criteria field and the alias of the criteria field;
If the alias match success of the criteria field in the current original field and default rule library to be matched, it is determined that The current original field to be matched and the criteria field pass through rule base successful match.
8. the method according to the description of claim 7 is characterized in that the method also includes:
In the case where the current original field to be matched and the criteria field pass through the unsuccessful situation of regular storehouse matching, calculate Second matching degree mixing index, the second matching degree mixing index are the current original fields and field value collection to be matched In criteria field matching degree mixing index;
In the case where the second matching degree mixing index is more than or equal to the second preset threshold, determine described current to be matched Original field and criteria field fuzzy matching success.
9. a kind of general record processing unit characterized by comprising
Recognition unit, for identification the gauge outfit pattern of original record;
Extracting unit is used for: record row is extracted from the original record based on the gauge outfit pattern;
Matching unit, for matching the original field in the original record with preset criteria field;
Generation unit is used for: in the record row extracted, replacing corresponding original word with the criteria field of successful match Section generates general record text.
10. device according to claim 9, which is characterized in that the recognition unit includes the first identification subelement, described First identification subelement is used for:
Determine the gauge outfit line range of the original record;
In the gauge outfit line range, the aiming field in each record row is matched with preset gauge outfit keyword;
All aiming fields in record row determine record row essence and in the corresponding successful situation of gauge outfit Keywords matching True successful match;
Using the record row of accurate successful match as gauge outfit row.
11. device according to claim 10, which is characterized in that the recognition unit further includes the second identification subelement, The second identification subelement is used for:
In the case where the record row in the gauge outfit line range accurately matches unsuccessful situation, the first matching degree mixing index is calculated, The first matching degree mixing index is that the aiming field in each record row is mixed with the matching degree of preset gauge outfit keyword Index;
All aiming fields in record row are greater than with the first matching degree mixing index of corresponding gauge outfit keyword In the case where the first preset threshold, record row fuzzy matching success is determined;
Fuzzy matching is successfully recorded to row as gauge outfit row.
12. the device according to any one of claim 9-11, which is characterized in that the extracting unit is used for:
It regard the corresponding column serial number distribution of effective column data in the original record as record rule;
Record row is extracted from the original record according to the record rule and the gauge outfit pattern.
13. the device according to any one of claim 9-11, which is characterized in that the matching unit includes the first matching Subelement, first coupling subelement are used for:
By the history match of the original field and the successful match of criteria field record write-in caching;
If the original field successful match in current original field to be matched and history match record, it is determined that described to work as Preceding original field to be matched and the criteria field pass through cache match success.
14. device according to claim 13, which is characterized in that the matching unit further includes the second coupling subelement, Second coupling subelement is used for:
It, will be described in the case where the current original field to be matched and the criteria field are by the unsuccessful situation of cache match Current original field to be matched is matched with the criteria field that preset field value is concentrated;
If the criteria field successful match that the current original field to be matched and preset field value are concentrated, it is determined that described Current original field to be matched and the criteria field pass through field value collection successful match.
15. device according to claim 14, which is characterized in that the matching unit further includes third coupling subelement, The third coupling subelement is used for:
It, will in the case where the current original field to be matched matches unsuccessful situation by field value collection with the criteria field The current original field to be matched is matched with the alias of the criteria field in default rule library, wherein the rule Then library is used to store the mapping relations between the criteria field and the alias of the criteria field;
If the alias match success of the criteria field in the current original field and default rule library to be matched, it is determined that The current original field to be matched and the criteria field pass through rule base successful match.
16. device according to claim 15, which is characterized in that the matching unit further includes the 4th coupling subelement, 4th coupling subelement is used for:
In the case where the current original field to be matched and the criteria field pass through the unsuccessful situation of regular storehouse matching, calculate Second matching degree mixing index, the second matching degree mixing index are the current original fields and field value collection to be matched In criteria field matching degree mixing index;
In the case where the second matching degree mixing index is more than or equal to the second preset threshold, determine described current to be matched Original field and criteria field fuzzy matching success.
17. a kind of electronic equipment characterized by comprising
At least one processor;And
The memory being connect at least one described processor communication;Wherein,
The memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one It manages device to execute, so that at least one described processor is able to carry out method of any of claims 1-8.
18. a kind of non-transitory computer-readable storage medium for being stored with computer instruction, which is characterized in that the computer refers to It enables for making the computer perform claim require method described in any one of 1-8.
CN201910799571.9A 2019-08-27 2019-08-27 General record processing method, device, electronic equipment and storage medium Pending CN110515999A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910799571.9A CN110515999A (en) 2019-08-27 2019-08-27 General record processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910799571.9A CN110515999A (en) 2019-08-27 2019-08-27 General record processing method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110515999A true CN110515999A (en) 2019-11-29

Family

ID=68628282

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910799571.9A Pending CN110515999A (en) 2019-08-27 2019-08-27 General record processing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110515999A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164393A (en) * 2020-10-10 2021-01-01 米奥兰特(浙江)网络科技有限公司 Communication establishing method and device based on data matching
CN112380214A (en) * 2020-11-13 2021-02-19 北京神州泰岳智能数据技术有限公司 Data processing method and device and electronic equipment
CN112465618A (en) * 2020-12-22 2021-03-09 航天信息股份有限公司企业服务分公司 Universal importing method and system for bank statement
CN112597927A (en) * 2020-12-28 2021-04-02 电子科技大学 Two-dimensional table identification method, device, equipment and system
CN112862537A (en) * 2021-03-02 2021-05-28 深圳前海微众银行股份有限公司 Method and device for issuing rights and interests
CN113836316A (en) * 2021-09-23 2021-12-24 北京百度网讯科技有限公司 Processing method, training method, device, equipment and medium for ternary group data

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164393A (en) * 2020-10-10 2021-01-01 米奥兰特(浙江)网络科技有限公司 Communication establishing method and device based on data matching
CN112164393B (en) * 2020-10-10 2021-08-13 米奥兰特(浙江)网络科技有限公司 Communication establishing method and device based on data matching
CN112380214A (en) * 2020-11-13 2021-02-19 北京神州泰岳智能数据技术有限公司 Data processing method and device and electronic equipment
CN112465618A (en) * 2020-12-22 2021-03-09 航天信息股份有限公司企业服务分公司 Universal importing method and system for bank statement
CN112597927A (en) * 2020-12-28 2021-04-02 电子科技大学 Two-dimensional table identification method, device, equipment and system
CN112862537A (en) * 2021-03-02 2021-05-28 深圳前海微众银行股份有限公司 Method and device for issuing rights and interests
CN113836316A (en) * 2021-09-23 2021-12-24 北京百度网讯科技有限公司 Processing method, training method, device, equipment and medium for ternary group data
CN113836316B (en) * 2021-09-23 2023-01-03 北京百度网讯科技有限公司 Processing method, training method, device, equipment and medium for ternary group data

Similar Documents

Publication Publication Date Title
CN110515999A (en) General record processing method, device, electronic equipment and storage medium
US8725717B2 (en) System and method for identifying topics for short text communications
JP5721818B2 (en) Use of model information group in search
US20120265772A1 (en) Media tag recommendation technologies
JP6428795B2 (en) Model generation method, word weighting method, model generation device, word weighting device, device, computer program, and computer storage medium
US11741094B2 (en) Method and system for identifying core product terms
CN105550206B (en) The edition control method and device of structured query sentence
CN112818230B (en) Content recommendation method, device, electronic equipment and storage medium
CN114579104A (en) Data analysis scene generation method, device, equipment and storage medium
CN112650910A (en) Method, device, equipment and storage medium for determining website update information
CN114925143A (en) Method, device, equipment, medium and product for describing logical model blood relationship
CN115827956A (en) Data information retrieval method and device, electronic equipment and storage medium
CN111209351A (en) Object relation prediction method and device, object recommendation method and device, electronic equipment and medium
CN110472034A (en) Detection method, device, equipment and the computer readable storage medium of question answering system
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
CN108132940B (en) Application program data extraction method and device
CN115599679A (en) Test rule base updating method and device, electronic equipment and storage medium
CN115510212A (en) Text event extraction method, device, equipment and storage medium
CN113407587B (en) Data processing method, device and equipment for online analysis processing engine
CN115827994A (en) Data processing method, device, equipment and storage medium
CN115600607A (en) Log detection method and device, electronic equipment and medium
CN113076395B (en) Semantic model training and search display method, device, equipment and storage medium
CN111783452B (en) Model training method, information processing method, device, equipment and storage medium
CN115017200A (en) Search result sorting method and device, electronic equipment and storage medium
CN114860872A (en) Data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination