CN110795425B - Customs data cleaning and merging method, device, equipment and medium - Google Patents

Customs data cleaning and merging method, device, equipment and medium Download PDF

Info

Publication number
CN110795425B
CN110795425B CN201911057701.8A CN201911057701A CN110795425B CN 110795425 B CN110795425 B CN 110795425B CN 201911057701 A CN201911057701 A CN 201911057701A CN 110795425 B CN110795425 B CN 110795425B
Authority
CN
China
Prior art keywords
data
bill
company name
name information
lading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911057701.8A
Other languages
Chinese (zh)
Other versions
CN110795425A (en
Inventor
李超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yiyuan Network Technology Co ltd
Original Assignee
Shanghai Yiyuan Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yiyuan Network Technology Co ltd filed Critical Shanghai Yiyuan Network Technology Co ltd
Priority to CN201911057701.8A priority Critical patent/CN110795425B/en
Publication of CN110795425A publication Critical patent/CN110795425A/en
Application granted granted Critical
Publication of CN110795425B publication Critical patent/CN110795425B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • Quality & Reliability (AREA)
  • Economics (AREA)
  • Operations Research (AREA)
  • General Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Educational Administration (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method, a device, equipment and a medium for cleaning and merging customs data, wherein the method comprises the steps of extracting an effective bill of lading data from original customs data; extracting company name information in bill of lading data; judging whether the extracted company name information is valid company name information; matching the regional information in the company name information according to a preset rule, and deleting the regional information in the company name information if the matching is successful; matching the suffix information in the company name information according to a preset rule, and if the matching is successful, converting the suffix information into standard format suffix information; and D, judging whether bill of lading data which can be combined with the bill of lading data after finishing the step five exists in the database, and if so, carrying out data combination. The invention extracts effective bill of lading data from the original customs data to carry out cleaning, processing and merging, and generates bill of lading data with uniform format and concentrated information, thereby facilitating the user to find out useful information.

Description

Customs data cleaning and merging method, device, equipment and medium
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a method, a device, equipment and a medium for cleaning and merging customs data.
Background
The customs data is various import and export statistics data generated in the customs fulfilling import and export trade statistics functions. Through deep mining of the data contents, enterprises can be helped to timely, comprehensively and observably grasp market trends, and overseas market business conditions can be analyzed.
However, the original customs data has the following problems:
firstly, the original customs data is large in quantity, so that the difficulty of inquiring useful information by a user is high;
secondly, the number of trade countries in customs data is large, so that the data is complex;
thirdly, the customs data has more junk information.
The user can arrange and process the original customs data by himself, and the difficulty in finding out useful information is high.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a strategy for cleaning and merging customs data, which is characterized in that effective bill of lading data is extracted from original customs data to be cleaned, processed and merged, bill of lading data with uniform format and concentrated information is generated, and a user can find useful information conveniently.
In order to solve the technical problems, the first aspect of the invention discloses a customs data cleaning and merging method, which comprises the following steps:
step one, extracting an effective bill of lading data from original customs data;
step two, extracting company name information in bill of lading data;
step three, judging whether the extracted company name information is valid company name information; if yes, entering a fourth step, and if not, entering a seventh step;
step four, matching the regional information in the company name information according to a preset rule, deleting the regional information in the company name information if the matching is successful, then entering a step five, and directly entering a step five if the matching is failed;
step five, matching the suffix information in the company name information according to a preset rule, if the matching is successful, converting the suffix information into the suffix information in a standard format, then entering step six, and if the matching is failed, directly entering step six;
step six, judging whether bill of lading data which can be combined with the bill of lading data after finishing step five exists in the database, if so, carrying out data combination;
step seven, extracting the next effective bill of lading data from the original customs data, and entering step two.
In the sixth step, the step of determining whether the database has bill of lading data that can be merged with the bill of lading data after the step five includes:
step 601, directly storing the bill of lading data after the step five into a database;
step 602, sorting all bill of lading data in the database according to company name information;
step 603, after the sorting is completed, extracting company name information in the bill of lading data adjacent to the newly stored bill of lading data, and extracting company name information in the newly stored bill of lading data;
step 604, performing similarity calculation on company name information in the latest stored bill of lading data and company name information in the adjacent bill of lading data; and if the similarity is greater than a threshold value, merging the latest stored bill of lading data with the adjacent bill of lading data.
In the above method for cleaning and merging customs data, in step 604, the similarity calculation is implemented by Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.
In the method for cleaning and merging customs data, in the fourth step, the region information in the company name information is matched according to a preset rule, and the matching is realized through regular matching.
The invention discloses a customs data cleaning and merging device, which comprises a bill of lading data extraction module, a company name information extraction module, a first judgment module, a first matching module, a second matching module and a second judgment module;
the bill of lading data extraction module is used for extracting effective bill of lading data from the original customs data;
the company name information extraction module is used for extracting company name information in the bill data extracted by the bill data extraction module;
the first judging module is used for judging whether the company name information extracted by the company name information extracting module is effective company name information, and if so, triggering the first matching module to operate and triggering the second matching module to operate;
the first matching module is used for matching the regional information in the company name information according to a preset rule, and deleting the regional information in the company name information if the matching is successful;
the second matching module is used for matching the suffix information in the company name information according to a preset rule, and if the matching is successful, the suffix information is converted into standard format suffix information;
the second judging module is used for judging whether the bill of lading data which can be combined with the bill of lading data processed by the first matching module and the second matching module exists in the database, if so, the data are combined, and the combined bill of lading data are stored in the database.
The second judging module comprises: the system comprises a data writing unit, a data sorting unit, a company name information extraction unit, a similarity calculation unit and a data merging unit;
the data writing unit is used for storing the bill of lading data processed by the first matching module and the second matching module into a database;
the data sorting unit is used for sorting all bill of lading data in the database according to company name information;
the company name information extraction unit is used for extracting company name information in the bill data adjacent to the newly stored bill data after the sorting is completed, and extracting company name information in the newly stored bill data;
the similarity calculation unit is used for calculating the similarity between the company name information in the latest stored bill of lading data and the company name information in the adjacent bill of lading data;
and the data merging unit is used for merging the latest stored bill of lading data with the adjacent bill of lading data when the similarity calculated by the similarity calculation unit is greater than a threshold value.
The similarity calculation algorithm in the similarity calculation unit is Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.
According to the customs data cleaning and merging device, the first matching module matches the regional information in the company name information according to the preset rule, and the matching is achieved through regular matching.
In a third aspect the invention discloses a terminal device for customs data cleansing and merging, comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the steps of the method according to the first aspect of the invention when said computer program is executed.
A fourth aspect of the invention discloses a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as disclosed in the first aspect of the invention.
Compared with the prior art, the invention has the following advantages:
1. the invention extracts effective bill of lading data from the original customs data and judges the company name information in the bill of lading data, cleans the bill of lading data of invalid company name information, and reduces the data quantity.
2. The method processes the company name information of the bill of lading data, deletes the area information in the company name information, and converts the suffix information in the company name information into the suffix information in the standard format, so that the company name information has a uniform format, and the business in the company name is highlighted, thereby being convenient for the accuracy of similarity calculation in the subsequent data combination.
3. The invention judges whether the bill of lading data can be combined by utilizing the similarity of the company name information, so that the bill of lading data of companies in different areas with the same business number can be combined, the number of the bill of lading is reduced, and one bill of lading can reflect more customs trade information.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
FIG. 1 is a flow chart of a method for data cleansing and merging according to the present invention.
FIG. 2 is a flowchart illustrating a method of step six in the data cleansing and merging method of the present invention.
FIG. 3 is a block diagram of a data cleansing and merging device according to the present invention.
Fig. 4 is a block diagram of a second determining module in the data cleansing and merging device according to the present invention.
Detailed Description
As shown in fig. 1, a method for cleaning and merging customs data includes the following steps:
step one, extracting an effective bill of lading data from original customs data;
extracting effective data from the original customs data by using a preset data field comparison table to form the whole effective bill of lading data, for example: the buyer can be found from the importer field and the vendor can be found from the exporter field. Therefore, no matter from which country customs the original customs data is subjected to data extraction, the bill of lading data finally extracted and generated is in a standard format, and the later data combination is convenient.
Step two, extracting company name information in bill of lading data;
step three, judging whether the extracted company name information is valid company name information; if yes, entering a fourth step, and if not, entering a seventh step;
judging whether the extracted company name information is valid company name information or not by using a preset invalid public library, for example: the company name is not available, and is invalid.
Step four, using regular matching to the regional information in the company name information, deleting the regional information in the company name information if the matching is successful, then entering step five, and directly entering step five if the matching is failed;
company area information is matched because companies will mostly have area information such as: after regular matching, the corresponding Shang Hai is company area information, and the area information is deleted to facilitate later similarity calculation, because a large group company can set up branch companies or sub-companies in different areas, and the word sizes of the branch companies or sub-companies in different areas are mostly similar or identical, after deleting the area information, the method is favorable for merging bill data of the branch companies or sub-companies set up in different areas belonging to a large group company.
Step five, regular matching is used for the suffix information in the company name information, and if the matching is successful, the suffix information is converted into standard format suffix information; then, entering a step six, and if the matching fails, directly entering the step six;
because part of the company name suffix is fully named, such as: "xxx Company Limited", so "xxx Company Limited" is converted into "xxx co., ltd.", which achieves the purpose of standardizing suffix information, thereby making company names more standard and uniform.
Step six, judging whether bill of lading data which can be combined with the bill of lading data after finishing step five exists in the database, if so, carrying out data combination;
the purpose of reducing the quantity of bill of lading data and centralizing information is achieved by merging bill of lading data which can be merged, so that a user can know trade information of a company at each customs when finding the bill of lading data of the company without inquiring about each customs.
Step seven, extracting the next effective bill of lading data from the original customs data, and entering step two.
And repeating the second step to the sixth step for a plurality of times, so that the bill of lading data which is uniform in format and concentrated in information after being processed is stored in a database.
In the embodiment, as shown in fig. 2, in the step six, when judging whether there is bill of lading data in the database, which can be combined with the bill of lading data after the step five is completed, the method includes:
step 601, directly storing the bill of lading data after the step five into a database;
the bill of lading data after the step five is finished is stored in the database, and whether the bill of lading data can be combined with the bill of lading data after the step five is finished is judged.
Step 602, sorting all bill of lading data in the database according to company name information;
since the subsequent similarity calculation is based on company name information, all bill of lading data in the database is firstly ordered according to the company name information, and after the ordering is finished, the bill of lading data which is most likely to be combined with the newly stored bill of lading data can be preliminarily judged as follows: and bill of lading data adjacent to the newly stored bill of lading data.
Step 603, after the sorting is completed, extracting company name information in the bill of lading data adjacent to the newly stored bill of lading data, and extracting company name information in the newly stored bill of lading data;
step 604, performing similarity calculation on company name information in the latest stored bill of lading data and company name information in the adjacent bill of lading data; and if the similarity is greater than a threshold value, merging the latest stored bill of lading data with the adjacent bill of lading data. The similarity is greater than a threshold, which may be 80% -90%.
After the sorting is finished, if the latest stored bill of lading data is arranged in the first position, the latest stored bill of lading data and the first bill of lading data arranged behind the latest stored bill of lading data are subjected to company name information similarity calculation; if the latest stored bill of lading data is arranged in the last position, carrying out company name information similarity calculation on the latest stored bill of lading data and the first bill of lading data arranged in front of the latest stored bill of lading data;
if the latest stored bill of lading data is not arranged in the last bit or in the last bit, carrying out company name information similarity calculation on the latest stored bill of lading data and the first bill of lading data arranged in front of the latest stored bill of lading data, and if the similarity is larger than a threshold value, carrying out data merging; if the similarity is not greater than the threshold value, the latest stored bill of lading data and the first bill of lading data arranged behind the latest stored bill of lading data are subjected to company name information similarity calculation.
In this embodiment, the similarity calculation in step 604 is implemented by Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.
The Levenstein Distance algorithm, NGram Distance algorithm and Jaro WinklerDistance algorithm are existing algorithms, and the principles are not described here in detail.
As shown in fig. 3, a customs data cleaning and merging device comprises a bill of lading data extraction module 1, a company name information extraction module 2, a first judgment module 3, a first matching module 4, a second matching module 5 and a second judgment module 6;
the bill of lading data extraction module 1 is used for extracting effective bill of lading data from original customs data;
the bill of lading data extraction module 1 utilizes the preset data field comparison table to extract the effective data from the original customs data to form the whole effective bill of lading data, such as: the buyer can be found from the importer field and the vendor can be found from the exporter field. Therefore, no matter from which country customs the original customs data is subjected to data extraction, the bill of lading data finally extracted and generated is in a standard format, and the later data combination is convenient.
The company name information extraction module 2 is used for extracting company name information in the bill data extracted by the bill data extraction module 1;
the first judging module 3 is configured to judge whether the company name information extracted by the company name information extracting module is valid company name information, and if so, trigger the first matching module 4 to operate and the second matching module 5 to operate;
the first judging module 3 judges whether the extracted company name information is valid company name information by using a preset "invalid public store", for example: the company name is not available, and is invalid.
The first matching module 4 is configured to use regular matching for the area information in the company name information, and if the matching is successful, delete the area information in the company name information;
company area information is matched because companies will mostly have area information such as: after regular matching, the corresponding Shang Hai is company area information, and the area information is deleted to facilitate later similarity calculation, because a large group company can set up branch companies or sub-companies in different areas, and the word sizes of the branch companies or sub-companies in different areas are mostly similar or identical, after deleting the area information, the method is favorable for merging bill data of the branch companies or sub-companies set up in different areas belonging to a large group company.
The second matching module 5 is configured to use regular matching for suffix information in company name information, and if matching is successful, convert the suffix information into suffix information in standard format;
because part of the company name suffix is fully named, such as: "xxx Company Limited", so "xxx Company Limited" is converted into "xxx co., ltd.", which achieves the purpose of standardizing suffix information, thereby making company names more standard and uniform.
The second judging module 6 is configured to judge whether there is bill of lading data in the database, where the bill of lading data can be combined with the bill of lading data processed by the first matching module 4 and the second matching module 5, and if there is bill of lading data, perform data combination.
The purpose of reducing the quantity of bill of lading data and centralizing information is achieved by merging bill of lading data which can be merged, so that a user can know trade information of a company at each customs when finding the bill of lading data of the company without inquiring about each customs.
As shown in fig. 4, in this embodiment, the second determining module 6 includes: a data writing unit 61, a data sorting unit 62, a company name information extracting unit 63, a similarity calculating unit 64, and a data merging unit 65.
A data writing unit 61, configured to store the bill of lading data processed by the first matching module 4 and the second matching module 5 into a database;
a data sorting unit 62, configured to sort all bill of lading data in the database according to company name information;
the data sort unit 62 operates to sort once each time there is new bill of lading data stored in the database.
A company name information extracting unit 63 for extracting company name information in the bill data adjacent to the newly stored bill data after the sorting is completed, extracting company name information in the newly stored bill data;
after the data sorting unit 62 performs sorting, the latest stored bill of lading data obtains its own sequence position, and since the subsequent similarity calculation is based on company name information, all bill of lading data in the database are sorted according to the company name information, and after sorting is completed, it can be preliminarily determined that the bill of lading data most likely to be combined with the latest stored bill of lading data is: and bill of lading data adjacent to the newly stored bill of lading data.
A similarity calculation unit 64 for performing similarity calculation on company name information in the newly stored bill of lading data and company name information in the adjacent bill of lading data;
and a data merging unit 65, configured to, when the similarity calculated by the similarity calculation unit 64 is greater than a threshold value, perform data merging on the newly stored bill of lading data and the adjacent bill of lading data.
After the sorting is finished, if the latest stored bill of lading data is arranged in the first position, the latest stored bill of lading data and the first bill of lading data arranged behind the latest stored bill of lading data are subjected to company name information similarity calculation; if the latest stored bill of lading data is arranged in the last position, carrying out company name information similarity calculation on the latest stored bill of lading data and the first bill of lading data arranged in front of the latest stored bill of lading data;
if the latest stored bill of lading data is not arranged in the last bit or in the last bit, carrying out company name information similarity calculation on the latest stored bill of lading data and the first bill of lading data arranged in front of the latest stored bill of lading data, and if the similarity is larger than a threshold value, carrying out data merging; if the similarity is not greater than the threshold value, the latest stored bill of lading data and the first bill of lading data arranged behind the latest stored bill of lading data are subjected to company name information similarity calculation.
In the present embodiment, the similarity calculation algorithm in the similarity calculation unit 64 is Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.
The terminal device for customs data cleaning and merging comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the method steps of the customs data cleaning and merging when executing the computer program.
A computer readable storage medium storing a computer program which when executed by a processor performs the method steps of customs data cleansing merging described above.
The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and any simple modification, variation and equivalent structural changes made to the above embodiment according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims (10)

1. A method for cleaning and merging customs data, comprising the steps of:
step one, extracting an effective bill of lading data from original customs data;
step two, extracting company name information in bill of lading data;
step three, judging whether the extracted company name information is valid company name information; if yes, entering a fourth step, and if not, entering a seventh step;
step four, matching the regional information in the company name information according to a preset rule, deleting the regional information in the company name information if the matching is successful, then entering a step five, and directly entering a step five if the matching is failed;
step five, matching the suffix information in the company name information according to a preset rule, if the matching is successful, converting the suffix information into the suffix information in a standard format, then entering step six, and if the matching is failed, directly entering step six;
step six, judging whether bill of lading data which can be combined with the bill of lading data after finishing step five exists in the database, if so, carrying out data combination;
step seven, extracting the next effective bill of lading data from the original customs data, and entering step two.
2. The method for cleaning and merging customs data according to claim 1, wherein the step six of determining whether there is bill of lading data in the database that can be merged with the bill of lading data after the step five is completed comprises:
step 601, directly storing the bill of lading data after the step five into a database;
step 602, sorting all bill of lading data in the database according to company name information;
step 603, after the sorting is completed, extracting company name information in the bill of lading data adjacent to the newly stored bill of lading data, and extracting company name information in the newly stored bill of lading data;
step 604, performing similarity calculation on company name information in the latest stored bill of lading data and company name information in the adjacent bill of lading data; and if the similarity is greater than a threshold value, merging the latest stored bill of lading data with the adjacent bill of lading data.
3. A method for cleaning and merging customs data according to claim 2, wherein the similarity calculation in step 604 is implemented by Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.
4. A method for cleaning and merging customs data according to claim 1, 2 or 3, wherein in the fourth step, the matching of the regional information in the company name information according to the preset rule is implemented by regular matching.
5. The customs data cleaning and merging device is characterized by comprising a bill of lading data extraction module, a company name information extraction module, a first judgment module, a first matching module, a second matching module and a second judgment module;
the bill of lading data extraction module is used for extracting effective bill of lading data from the original customs data;
the company name information extraction module is used for extracting company name information in the bill data extracted by the bill data extraction module;
the first judging module is used for judging whether the company name information extracted by the company name information extracting module is effective company name information, and if so, triggering the first matching module to operate and triggering the second matching module to operate;
the first matching module is used for matching the regional information in the company name information according to a preset rule, and deleting the regional information in the company name information if the matching is successful;
the second matching module is used for matching the suffix information in the company name information according to a preset rule, and if the matching is successful, the suffix information is converted into standard format suffix information;
the second judging module is used for judging whether bill of lading data which can be combined with the bill of lading data processed by the first matching module and the second matching module exists in the database, and if so, data combination is carried out.
6. The apparatus for customs data cleansing and merging of claim 5, wherein said second determining module comprises: the system comprises a data writing unit, a data sorting unit, a company name information extraction unit, a similarity calculation unit and a data merging unit;
the data writing unit is used for storing the bill of lading data processed by the first matching module and the second matching module into a database;
the data sorting unit is used for sorting all bill of lading data in the database according to company name information;
the company name information extraction unit is used for extracting company name information in the bill data adjacent to the newly stored bill data after the sorting is completed, and extracting company name information in the newly stored bill data;
the similarity calculation unit is used for calculating the similarity between the company name information in the latest stored bill of lading data and the company name information in the adjacent bill of lading data;
and the data merging unit is used for merging the latest stored bill of lading data with the adjacent bill of lading data when the similarity calculated by the similarity calculation unit is greater than a threshold value.
7. A device for cleaning and merging customs data according to claim 6, wherein the similarity calculation algorithm in the similarity calculation unit is Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.
8. The device for cleaning and merging customs data according to claim 5, 6 or 7, wherein the first matching module matches the regional information in the company name information according to a preset rule, which is implemented by regular matching.
9. A customs data cleansing and merging terminal comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1-4 when executing the computer program.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any of claims 1-4.
CN201911057701.8A 2019-10-31 2019-10-31 Customs data cleaning and merging method, device, equipment and medium Active CN110795425B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911057701.8A CN110795425B (en) 2019-10-31 2019-10-31 Customs data cleaning and merging method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911057701.8A CN110795425B (en) 2019-10-31 2019-10-31 Customs data cleaning and merging method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN110795425A CN110795425A (en) 2020-02-14
CN110795425B true CN110795425B (en) 2023-04-28

Family

ID=69442388

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911057701.8A Active CN110795425B (en) 2019-10-31 2019-10-31 Customs data cleaning and merging method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN110795425B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111429068B (en) * 2020-03-31 2021-04-23 天津市商务局(天津市人民政府口岸服务办公室) Data supervision method, device and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006098529A1 (en) * 2005-03-16 2006-09-21 Joo Seok Kim A method of trade-related data exchanging and service providing among different kinds of systems
CN107066599A (en) * 2017-04-20 2017-08-18 北京文因互联科技有限公司 A kind of similar enterprise of the listed company searching classification method and system of knowledge based storehouse reasoning
CN109858538A (en) * 2019-01-24 2019-06-07 科大国创软件股份有限公司 A kind of customs's classification error-detecting method based on correlation rule

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006098529A1 (en) * 2005-03-16 2006-09-21 Joo Seok Kim A method of trade-related data exchanging and service providing among different kinds of systems
CN107066599A (en) * 2017-04-20 2017-08-18 北京文因互联科技有限公司 A kind of similar enterprise of the listed company searching classification method and system of knowledge based storehouse reasoning
CN109858538A (en) * 2019-01-24 2019-06-07 科大国创软件股份有限公司 A kind of customs's classification error-detecting method based on correlation rule

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘健 ; 李业伟 ; .数据库关联匹配法在服贸海运费核查中的应用.中国外汇.2015,(11),全文. *

Also Published As

Publication number Publication date
CN110795425A (en) 2020-02-14

Similar Documents

Publication Publication Date Title
US7324998B2 (en) Document search methods and systems
US10705748B2 (en) Method and device for file name identification and file cleaning
CN110781246A (en) Enterprise association relationship construction method and system
CN109656999B (en) Method, device, storage medium and apparatus for synchronizing large data volume data
CN105187242B (en) A kind of user's anomaly detection method excavated based on variable-length pattern
CN110413569A (en) Archives of paper quality electronization archiving method, device and terminal device
CN101751475B (en) Method for compressing section records and device therefor
CN110888981A (en) Title-based document clustering method and device, terminal equipment and medium
CN110795425B (en) Customs data cleaning and merging method, device, equipment and medium
CN110825817B (en) Enterprise suspected association judgment method and system
CN110704432A (en) Data index establishing method and device, readable storage medium and electronic equipment
CN116032741A (en) Equipment identification method and device, electronic equipment and computer storage medium
CN107590233A (en) A kind of file management method and device
CN113821630A (en) Data clustering method and device
CN110705297A (en) Enterprise name-identifying method, system, medium and equipment
CN113486157B (en) Method for decrypting encrypted mobile phone number
CN109408727B (en) Intelligent user attention information recommendation method and system based on multidimensional perception data
JPH05257982A (en) Character string recognizing method
CN112559775A (en) Patent information management method and system and computer equipment
CN109034938B (en) Information rapid screening and matching method and device, electronic equipment and storage medium
JP2018163405A (en) Automated sorting device for transaction detail, automated sorting method and program for automated sorting
CN109783607A (en) A method of the match cognization magnanimity keyword in any text
CN111325025B (en) Shop name mining method and device
CN114064621B (en) Method for judging repeated data
CN112882887B (en) Dynamic establishment method for service fault model in cloud computing environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant