CN110795425A - Method, device, equipment and medium for cleaning and merging customs data - Google Patents

Method, device, equipment and medium for cleaning and merging customs data Download PDF

Info

Publication number
CN110795425A
CN110795425A CN201911057701.8A CN201911057701A CN110795425A CN 110795425 A CN110795425 A CN 110795425A CN 201911057701 A CN201911057701 A CN 201911057701A CN 110795425 A CN110795425 A CN 110795425A
Authority
CN
China
Prior art keywords
data
bill
company name
name information
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911057701.8A
Other languages
Chinese (zh)
Other versions
CN110795425B (en
Inventor
李超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yiyuan Network Technology Co Ltd
Original Assignee
Shanghai Yiyuan Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yiyuan Network Technology Co Ltd filed Critical Shanghai Yiyuan Network Technology Co Ltd
Priority to CN201911057701.8A priority Critical patent/CN110795425B/en
Publication of CN110795425A publication Critical patent/CN110795425A/en
Application granted granted Critical
Publication of CN110795425B publication Critical patent/CN110795425B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • Quality & Reliability (AREA)
  • Economics (AREA)
  • Operations Research (AREA)
  • General Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Educational Administration (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method, a device, equipment and a medium for cleaning and merging customs data, wherein the method comprises the steps of extracting effective bill of lading data from original customs data; extracting company name information in the bill of lading data; judging whether the extracted company name information is valid company name information; matching the area information in the company name information according to a preset rule, and deleting the area information in the company name information if the matching is successful; matching suffix information in the company name information according to a preset rule, and if the matching is successful, converting the suffix information into standard format suffix information; and D, judging whether bill picking data capable of being combined with the bill picking data after the fifth step exists in the database, and if so, combining the data. The invention extracts effective bill drawing data from the original customs data to clean, process and combine, generates bill drawing data with uniform format and centralized information, and is convenient for users to find out useful information.

Description

Method, device, equipment and medium for cleaning and merging customs data
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a method, a device, equipment and a medium for cleaning and merging customs data.
Background
The customs data is the import and export statistical data generated by customs in the business statistical function of the import and export. By deeply mining the data contents, enterprises can be helped to timely, comprehensively and considerably master the market trend and analyze the business conditions of overseas markets.
But the original customs data has the following problems:
firstly, the quantity of original customs data is large, so that the difficulty of a user for inquiring useful information is high;
secondly, the customs data has more trade countries, which leads to complex data;
thirdly, the customs data has much garbage information.
It is very difficult to find out useful information by the user to process the original customs data.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a customs data cleaning and merging strategy, aiming at the deficiencies in the prior art, which extracts effective bill extraction data from the original customs data to clean, process and merge, so as to generate bill extraction data with uniform format and centralized information, thereby facilitating the user to find out useful information.
In order to solve the technical problem, the first aspect of the invention discloses a method for cleaning and merging customs data, which comprises the following steps:
step one, extracting effective bill of lading data from original customs data;
step two, extracting company name information in the bill of lading data;
step three, judging whether the extracted company name information is effective company name information; if yes, entering the step four, and if not, entering the step seven;
step four, matching the area information in the company name information according to a preset rule, if the matching is successful, deleting the area information in the company name information, and then entering step five, and if the matching is failed, directly entering step five;
step five, matching suffix information in the company name information according to a preset rule, if the matching is successful, converting the suffix information into standard format suffix information, and then entering step six, and if the matching is failed, directly entering step six;
step six, judging whether bill picking data which can be combined with the bill picking data after the step five exists in the database or not, and if so, combining the data;
and step seven, extracting the next effective bill of lading data from the original customs data, and entering the step two.
In the method for cleaning and merging customs data, when determining whether there is bill picking data in the database, which can be merged with the bill picking data after the completion of the step five, in the step six, the method includes:
601, directly storing the bill picking data after the fifth step into a database;
step 602, sorting all bill of lading data in the database according to company name information;
603, after the sorting is finished, extracting company name information in the bill of lading data adjacent to the latest stored bill of lading data, and extracting company name information in the latest stored bill of lading data;
step 604, calculating the similarity between the company name information in the latest stored bill of lading data and the company name information in the adjacent bill of lading data; and if the similarity is greater than the threshold value, carrying out data combination on the latest stored bill picking data and the adjacent bill picking data.
In the method for cleaning and merging customs data, the similarity calculation in step 604 is implemented by a Levenstein Distance algorithm, an NGram Distance algorithm, or a Jaro Winkler Distance algorithm.
In the method for cleaning and merging customs data, the area information in the company name information is matched according to the preset rule in the fourth step, and the matching is realized through regular matching.
The invention discloses a customs data cleaning and merging device, which comprises a bill of lading data extraction module, a company name information extraction module, a first judgment module, a first matching module, a second matching module and a second judgment module, wherein the bill of lading data extraction module is used for extracting company name information;
the bill drawing data extraction module is used for extracting effective bill drawing data from the original customs data;
the company name information extraction module is used for extracting company name information in the bill drawing data extracted by the bill drawing data extraction module;
the first judging module is used for judging whether the company name information extracted by the company name information extracting module is valid company name information or not, and if so, triggering the first matching module to operate and the second matching module to operate;
the first matching module is used for matching the area information in the company name information according to a preset rule, and if the matching is successful, deleting the area information in the company name information;
the second matching module is used for matching suffix information in the company name information according to a preset rule, and if the matching is successful, the suffix information is converted into standard format suffix information;
and the second judgment module is used for judging whether the database has bill picking data which can be combined with the bill picking data processed by the first matching module and the second matching module, if so, combining the data, and storing the combined bill picking data into the database.
The customs data cleaning and merging device comprises: the system comprises a data writing unit, a data sorting unit, a company name information extracting unit, a similarity calculating unit and a data merging unit;
the data writing unit is used for storing the bill picking data processed by the first matching module and the second matching module into a database;
the data sorting unit is used for sorting all bill of lading data in the database according to company name information;
the company name information extraction unit is used for extracting company name information in bill of lading data adjacent to the latest stored bill of lading data and extracting company name information in the latest stored bill of lading data after the sorting is finished;
the similarity calculation unit is used for calculating the similarity between the company name information in the latest stored bill of lading data and the company name information in the adjacent bill of lading data;
and the data merging unit is used for merging the latest stored bill raising data and the adjacent bill raising data when the similarity calculated by the similarity calculation unit is greater than a threshold value.
In the above customs data cleaning and merging device, the similarity calculation algorithm in the similarity calculation unit is a Levenstein Distance algorithm, an NGram Distance algorithm or a Jaro Winkler Distance algorithm.
According to the device for cleaning and merging customs data, the first matching module matches the regional information in the company name information according to the preset rule, and the matching is realized through regular matching.
In a third aspect of the present invention, a customs data cleaning and merging terminal device is disclosed, which comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method according to the first aspect of the present invention when executing the computer program.
A fourth aspect of the invention discloses a computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of the method as disclosed in the first aspect of the invention.
Compared with the prior art, the invention has the following advantages:
1. the invention extracts effective bill drawing data from the original customs data and judges the company name information in the bill drawing data, thereby cleaning the bill drawing data of invalid company name information and reducing the data volume.
2. The method processes the company name information of the bill drawing data, deletes the area information in the company name information, and converts the suffix information in the company name information into the suffix information in a standard format, so that the company name information has a uniform format, and the business number in the company name is highlighted, thereby facilitating the accuracy of similarity calculation when the subsequent data are combined.
3. The invention judges whether the bill drawing data can be merged or not by utilizing the similarity of the company name information, so that the bill drawing data of companies in different areas with the same business number can be merged, the number of the bills can be reduced, and one bill can reflect more customs trade information.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a flow chart of a method for data cleansing and merging according to the present invention.
FIG. 2 is a flowchart of a sixth step of the data cleansing and merging method of the present invention.
FIG. 3 is a block diagram of a data cleansing and merging apparatus according to the present invention.
FIG. 4 is a block diagram of a second determining module of the data cleansing and assembling apparatus according to the present invention.
Detailed Description
As shown in fig. 1, a method for cleaning and merging customs data includes the following steps:
step one, extracting effective bill of lading data from original customs data;
extracting effective data from the original customs data by using a preset data field comparison table to form the whole effective bill of lading data, such as: the buyer can be found according to the importer field, and the supplier can be found according to the exporter field. Therefore, no matter which country customs original customs data is subjected to data extraction, the finally extracted and generated bill of lading data is in a standard format, and later-stage data combination is facilitated.
Step two, extracting company name information in the bill of lading data;
step three, judging whether the extracted company name information is effective company name information; if yes, entering the step four, and if not, entering the step seven;
and judging whether the extracted company name information is valid company name information or not by using a preset invalid company library, such as: if the company is named not available, the company is an invalid company.
Step four, using regular matching for the area information in the company name information, if the matching is successful, deleting the area information in the company name information, and then entering step five, and if the matching is failed, directly entering step five;
the regional information of the company is matched because the company mostly has regional information, such as: "Shang Hai XXX Co., Ltd.", after using regular matching, the corresponding "Shang Hai" is the company area information, and the deletion of the area information is convenient for the later-stage similarity calculation, because a large group company can set up branch companies or sub-companies in different areas, and the word numbers of the set-up branch companies or sub-companies in different areas are mostly similar or the same, after the deletion of the area information, the combination of the bill data of the set-up branch companies or sub-companies in different areas belonging to a large group company is facilitated.
Step five, using regular matching to the suffix information in the company name information, and if the matching is successful, converting the suffix information into standard format suffix information; then entering a sixth step, and if the matching fails, directly entering the sixth step;
because part of the company name suffix is full, as: "xxx Company Limited", so "xxxcompanity Limited" is converted into "xxx co.
Step six, judging whether bill picking data which can be combined with the bill picking data after the step five exists in the database or not, and if so, combining the data;
the combined bill drawing data is combined to achieve the purposes of reducing the quantity of the bill drawing data and centralizing information, so that when a user finds the bill drawing data of a company, the user can know the trade information of the company at each customs, and does not need to inquire each customs.
And step seven, extracting the next effective bill of lading data from the original customs data, and entering the step two.
And repeating the second step to the sixth step for multiple times, so that the bill picking data with uniform format and centralized information after being processed is stored in the database.
As shown in fig. 2, in this embodiment, the step six of determining whether there is bill picking data that can be merged with the bill picking data after the step five of determining whether there is bill picking data in the database includes:
601, directly storing the bill picking data after the fifth step into a database;
and D, whether the database has bill picking data which can be merged with the bill picking data after the step five exists or not, the bill picking data after the step five needs to be stored in the database, so that the bill picking data after the step five is stored firstly, and then whether merging operation is needed or not is judged.
Step 602, sorting all bill of lading data in the database according to company name information;
because the subsequent similarity calculation is mainly based on the company name information, all the bill drawing data in the database are firstly sorted according to the company name information, and after the sorting is finished, the bill drawing data which is most likely to be combined with the latest stored bill drawing data is preliminarily judged to be: bill of lading data adjacent to the latest stored bill of lading data.
603, after the sorting is finished, extracting company name information in the bill of lading data adjacent to the latest stored bill of lading data, and extracting company name information in the latest stored bill of lading data;
step 604, calculating the similarity between the company name information in the latest stored bill of lading data and the company name information in the adjacent bill of lading data; and if the similarity is greater than the threshold value, carrying out data combination on the latest stored bill picking data and the adjacent bill picking data. The similarity is greater than a threshold value which can be 80% -90%.
After the sorting is finished, if the latest stored bill of lading data is ranked at the first position, carrying out company name information similarity calculation on the latest stored bill of lading data and the first listed bill of lading data ranked behind the latest stored bill of lading data; if the latest stored bill of lading data is ranked at the last position, carrying out company name information similarity calculation on the latest stored bill of lading data and the first bill of lading data ranked in front of the latest stored bill of lading data;
if the latest stored bill picking data is not ranked at the last position or the last position, the latest stored bill picking data and the first bill picking data ranked before the latest stored bill picking data are subjected to company name information similarity calculation, and if the similarity is larger than a threshold value, data merging is carried out; if the similarity is not larger than the threshold value, the latest stored bill of lading data and the first bill of lading data ranked behind the latest stored bill of lading data are subjected to company name information similarity calculation.
In this embodiment, the similarity calculation in step 604 is implemented by a Levenstein Distance algorithm, an NGramDistance algorithm, or a Jaro Winkler Distance algorithm.
The Levenstein Distance algorithm, the NGram Distance algorithm and the Jaro winklerDistance algorithm are conventional algorithms and are not described in detail herein.
As shown in fig. 3, a customs data cleaning and merging device includes a bill of lading data extraction module 1, a company name information extraction module 2, a first judgment module 3, a first matching module 4, a second matching module 5, and a second judgment module 6;
the bill drawing data extraction module 1 is used for extracting effective bill drawing data from the original customs data;
the bill extraction module 1 extracts effective data from the original customs data by using a preset data field comparison table to form the whole effective bill extraction data, such as: the buyer can be found according to the importer field, and the supplier can be found according to the exporter field. Therefore, no matter which country customs original customs data is subjected to data extraction, the finally extracted and generated bill of lading data is in a standard format, and later-stage data combination is facilitated.
The company name information extraction module 2 is used for extracting company name information in the bill of lading data extracted by the bill of lading data extraction module 1;
the first judging module 3 is used for judging whether the company name information extracted by the company name information extracting module is valid company name information, and if so, triggering the first matching module 4 to operate and the second matching module 5 to operate;
the first judging module 3 judges whether the extracted company name information is valid company name information by using a preset "invalid company library", such as: if the company is named not available, the company is an invalid company.
The first matching module 4 is configured to use regular matching for the area information in the company name information, and delete the area information in the company name information if matching is successful;
the regional information of the company is matched because the company mostly has regional information, such as: "Shang Hai XXX Co., Ltd.", after using regular matching, the corresponding "Shang Hai" is the company area information, and the deletion of the area information is convenient for the later-stage similarity calculation, because a large group company can set up branch companies or sub-companies in different areas, and the word numbers of the set-up branch companies or sub-companies in different areas are mostly similar or the same, after the deletion of the area information, the combination of the bill data of the set-up branch companies or sub-companies in different areas belonging to a large group company is facilitated.
The second matching module 5 is configured to use regular matching for suffix information in company name information, and if matching is successful, convert the suffix information into standard format suffix information;
because part of the company name suffix is full, as: "xxx Company Limited", so "xxxcompanity Limited" is converted into "xxx co.
The second judging module 6 is configured to judge whether there is bill picking data in the database that can be merged with the bill picking data processed by the first matching module 4 and the second matching module 5, and if so, merge the data.
The combined bill drawing data is combined to achieve the purposes of reducing the quantity of the bill drawing data and centralizing information, so that when a user finds the bill drawing data of a company, the user can know the trade information of the company at each customs, and does not need to inquire each customs.
As shown in fig. 4, in this embodiment, the second determining module 6 includes: a data writing unit 61, a data sorting unit 62, a company name information extracting unit 63, a similarity calculating unit 64, and a data merging unit 65.
The data writing unit 61 is used for storing the bill picking data processed by the first matching module 4 and the second matching module 5 into a database;
the data sorting unit 62 is used for sorting all bill of lading data in the database according to company name information;
the data sorting unit 62 operates the sorting once each time a new bill of lading data is stored in the database.
A company name information extracting unit 63 configured to extract company name information in the bill of lading data adjacent to the latest stored bill of lading data after the sorting is completed, and extract company name information in the latest stored bill of lading data;
after the data sorting unit 62 operates and sorts, the latest stored bill picking data obtains its own sequence position, and since the subsequent similarity calculation is based on the company name information, all bill picking data in the database are sorted according to the company name information, and after the sorting is completed, the bill picking data most likely to be merged with the latest stored bill picking data can be preliminarily judged to be: bill of lading data adjacent to the latest stored bill of lading data.
A similarity calculation unit 64 for performing similarity calculation between the company name information in the latest stored bill of lading data and the company name information in the adjacent bill of lading data;
a data merging unit 65 for data merging the latest stored bill of lading data with the adjacent bill of lading data when the similarity calculated by the similarity calculation unit 64 is greater than the threshold.
After the sorting is finished, if the latest stored bill of lading data is ranked at the first position, carrying out company name information similarity calculation on the latest stored bill of lading data and the first listed bill of lading data ranked behind the latest stored bill of lading data; if the latest stored bill of lading data is ranked at the last position, carrying out company name information similarity calculation on the latest stored bill of lading data and the first bill of lading data ranked in front of the latest stored bill of lading data;
if the latest stored bill picking data is not ranked at the last position or the last position, the latest stored bill picking data and the first bill picking data ranked before the latest stored bill picking data are subjected to company name information similarity calculation, and if the similarity is larger than a threshold value, data merging is carried out; if the similarity is not larger than the threshold value, the latest stored bill of lading data and the first bill of lading data ranked behind the latest stored bill of lading data are subjected to company name information similarity calculation.
In this embodiment, the similarity calculation algorithm in the similarity calculation unit 64 is a Levenstein Distance algorithm, an NGram Distance algorithm, or a Jaro Winkler Distance algorithm.
A customs data cleaning and merging terminal device comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor realizes the customs data cleaning and merging method steps when executing the computer program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the above-mentioned method steps of customs data cleansing and merging.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and all simple modifications, changes and equivalent structural changes made to the above embodiment according to the technical spirit of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims (10)

1. A method for cleaning and merging customs data is characterized by comprising the following steps:
step one, extracting effective bill of lading data from original customs data;
step two, extracting company name information in the bill of lading data;
step three, judging whether the extracted company name information is effective company name information; if yes, entering the step four, and if not, entering the step seven;
step four, matching the area information in the company name information according to a preset rule, if the matching is successful, deleting the area information in the company name information, and then entering step five, and if the matching is failed, directly entering step five;
step five, matching suffix information in the company name information according to a preset rule, if the matching is successful, converting the suffix information into standard format suffix information, and then entering step six, and if the matching is failed, directly entering step six;
step six, judging whether bill picking data which can be combined with the bill picking data after the step five exists in the database or not, and if so, combining the data;
and step seven, extracting the next effective bill of lading data from the original customs data, and entering the step two.
2. The method for cleaning and merging customs data according to claim 1, wherein the step six of determining whether there is bill picking data in the database that can be merged with the bill picking data after the step five of determining comprises:
601, directly storing the bill picking data after the fifth step into a database;
step 602, sorting all bill of lading data in the database according to company name information;
603, after the sorting is finished, extracting company name information in the bill of lading data adjacent to the latest stored bill of lading data, and extracting company name information in the latest stored bill of lading data;
step 604, calculating the similarity between the company name information in the latest stored bill of lading data and the company name information in the adjacent bill of lading data; and if the similarity is greater than the threshold value, carrying out data combination on the latest stored bill picking data and the adjacent bill picking data.
3. The method for cleansing and merging customs data according to claim 2, wherein the similarity calculation in step 604 is performed by a Levenstein Distance algorithm, an NGram Distance algorithm, or a Jaro WinklerDistance algorithm.
4. The customs data cleaning and merging method according to claim 1, 2 or 3, wherein the step four of matching the region information in the company name information according to a preset rule is realized by regular matching.
5. A customs data cleaning and merging device is characterized by comprising a bill of lading data extraction module, a company name information extraction module, a first judgment module, a first matching module, a second matching module and a second judgment module;
the bill drawing data extraction module is used for extracting effective bill drawing data from the original customs data;
the company name information extraction module is used for extracting company name information in the bill drawing data extracted by the bill drawing data extraction module;
the first judging module is used for judging whether the company name information extracted by the company name information extracting module is valid company name information or not, and if so, triggering the first matching module to operate and the second matching module to operate;
the first matching module is used for matching the area information in the company name information according to a preset rule, and if the matching is successful, deleting the area information in the company name information;
the second matching module is used for matching suffix information in the company name information according to a preset rule, and if the matching is successful, the suffix information is converted into standard format suffix information;
and the second judging module is used for judging whether the database has bill picking data which can be combined with the bill picking data processed by the first matching module and the second matching module, and if so, combining the data.
6. The customs data cleaning and merging apparatus of claim 5, wherein the second determining module comprises: the system comprises a data writing unit, a data sorting unit, a company name information extracting unit, a similarity calculating unit and a data merging unit;
the data writing unit is used for storing the bill picking data processed by the first matching module and the second matching module into a database;
the data sorting unit is used for sorting all bill of lading data in the database according to company name information;
the company name information extraction unit is used for extracting company name information in bill of lading data adjacent to the latest stored bill of lading data and extracting company name information in the latest stored bill of lading data after the sorting is finished;
the similarity calculation unit is used for calculating the similarity between the company name information in the latest stored bill of lading data and the company name information in the adjacent bill of lading data;
and the data merging unit is used for merging the latest stored bill raising data and the adjacent bill raising data when the similarity calculated by the similarity calculation unit is greater than a threshold value.
7. The apparatus for cleaning and merging customs data according to claim 6, wherein the similarity calculation algorithm in the similarity calculation unit is a Levenstein Distance algorithm, an NGram Distance algorithm, or a Jaro WinklerDistance algorithm.
8. The customs data cleaning and merging device according to claim 5, 6 or 7, wherein the first matching module matches the region information in the company name information according to a preset rule by regular matching.
9. A customs data cleansing and merging terminal device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, wherein said processor when executing said computer program implements the steps of the method according to any of claims 1-4.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.
CN201911057701.8A 2019-10-31 2019-10-31 Customs data cleaning and merging method, device, equipment and medium Active CN110795425B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911057701.8A CN110795425B (en) 2019-10-31 2019-10-31 Customs data cleaning and merging method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911057701.8A CN110795425B (en) 2019-10-31 2019-10-31 Customs data cleaning and merging method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN110795425A true CN110795425A (en) 2020-02-14
CN110795425B CN110795425B (en) 2023-04-28

Family

ID=69442388

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911057701.8A Active CN110795425B (en) 2019-10-31 2019-10-31 Customs data cleaning and merging method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN110795425B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111429068A (en) * 2020-03-31 2020-07-17 天津市商务局(天津市人民政府口岸服务办公室) Data supervision method, device and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006098529A1 (en) * 2005-03-16 2006-09-21 Joo Seok Kim A method of trade-related data exchanging and service providing among different kinds of systems
CN107066599A (en) * 2017-04-20 2017-08-18 北京文因互联科技有限公司 A kind of similar enterprise of the listed company searching classification method and system of knowledge based storehouse reasoning
CN109858538A (en) * 2019-01-24 2019-06-07 科大国创软件股份有限公司 A kind of customs's classification error-detecting method based on correlation rule

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006098529A1 (en) * 2005-03-16 2006-09-21 Joo Seok Kim A method of trade-related data exchanging and service providing among different kinds of systems
CN107066599A (en) * 2017-04-20 2017-08-18 北京文因互联科技有限公司 A kind of similar enterprise of the listed company searching classification method and system of knowledge based storehouse reasoning
CN109858538A (en) * 2019-01-24 2019-06-07 科大国创软件股份有限公司 A kind of customs's classification error-detecting method based on correlation rule

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘健;李业伟;: "数据库关联匹配法在服贸海运费核查中的应用" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111429068A (en) * 2020-03-31 2020-07-17 天津市商务局(天津市人民政府口岸服务办公室) Data supervision method, device and system
CN111429068B (en) * 2020-03-31 2021-04-23 天津市商务局(天津市人民政府口岸服务办公室) Data supervision method, device and system

Also Published As

Publication number Publication date
CN110795425B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN110781246A (en) Enterprise association relationship construction method and system
CN109635276B (en) Information matching method and terminal
CN101751475B (en) Method for compressing section records and device therefor
CN101986296A (en) Noise data cleaning method based on semantic ontology
CN108536657B (en) Method and system for processing similarity of artificially filled address texts
CN110413569A (en) Archives of paper quality electronization archiving method, device and terminal device
CN110990390A (en) Data cooperative processing method and device, computer equipment and storage medium
CN110888981A (en) Title-based document clustering method and device, terminal equipment and medium
CN113987190A (en) Data quality check rule extraction method and system
CN110895533B (en) Form mapping method and device, computer equipment and storage medium
CN110795425A (en) Method, device, equipment and medium for cleaning and merging customs data
US7458001B2 (en) Sequential pattern extracting apparatus
US20210019297A1 (en) Service data processing
CN112036150A (en) Electricity price policy term analysis method, storage medium and computer
CN111221967A (en) Language data classification storage system based on block chain architecture
CN113535739B (en) Data market layer table establishing method based on power grid energy data
CN105573984A (en) Socio-economic indicator identification method and device
CN115185933A (en) Multi-source manufacturing data preprocessing method for aerospace products
CN107391695A (en) A kind of information extracting method based on big data
CN113886420A (en) SQL statement generation method and device, electronic equipment and storage medium
CN112559775A (en) Patent information management method and system and computer equipment
CN111010331A (en) E-mail monitoring and summarizing method, system, terminal and storage medium
CN114115825B (en) Front-end and back-end data verification method compatible with software
CN112100161B (en) Data processing method and system, electronic device and storage medium
CN114331252A (en) Mobile phone number repairing method, express delivery distribution system and computer program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant