CN110795425B

CN110795425B - Customs data cleaning and merging method, device, equipment and medium

Info

Publication number: CN110795425B
Application number: CN201911057701.8A
Authority: CN
Inventors: 李超
Original assignee: Shanghai Yiyuan Network Technology Co ltd
Current assignee: Shanghai Yiyuan Network Technology Co ltd
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2023-04-28
Anticipated expiration: 2039-10-31
Also published as: CN110795425A

Abstract

The invention discloses a method, a device, equipment and a medium for cleaning and merging customs data, wherein the method comprises the steps of extracting an effective bill of lading data from original customs data; extracting company name information in bill of lading data; judging whether the extracted company name information is valid company name information; matching the regional information in the company name information according to a preset rule, and deleting the regional information in the company name information if the matching is successful; matching the suffix information in the company name information according to a preset rule, and if the matching is successful, converting the suffix information into standard format suffix information; and D, judging whether bill of lading data which can be combined with the bill of lading data after finishing the step five exists in the database, and if so, carrying out data combination. The invention extracts effective bill of lading data from the original customs data to carry out cleaning, processing and merging, and generates bill of lading data with uniform format and concentrated information, thereby facilitating the user to find out useful information.

Description

Customs data cleaning and merging method, device, equipment and medium

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a method, a device, equipment and a medium for cleaning and merging customs data.

Background

The customs data is various import and export statistics data generated in the customs fulfilling import and export trade statistics functions. Through deep mining of the data contents, enterprises can be helped to timely, comprehensively and observably grasp market trends, and overseas market business conditions can be analyzed.

However, the original customs data has the following problems:

firstly, the original customs data is large in quantity, so that the difficulty of inquiring useful information by a user is high;

secondly, the number of trade countries in customs data is large, so that the data is complex;

thirdly, the customs data has more junk information.

The user can arrange and process the original customs data by himself, and the difficulty in finding out useful information is high.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a strategy for cleaning and merging customs data, which is characterized in that effective bill of lading data is extracted from original customs data to be cleaned, processed and merged, bill of lading data with uniform format and concentrated information is generated, and a user can find useful information conveniently.

In order to solve the technical problems, the first aspect of the invention discloses a customs data cleaning and merging method, which comprises the following steps:

step one, extracting an effective bill of lading data from original customs data;

step two, extracting company name information in bill of lading data;

step three, judging whether the extracted company name information is valid company name information; if yes, entering a fourth step, and if not, entering a seventh step;

step four, matching the regional information in the company name information according to a preset rule, deleting the regional information in the company name information if the matching is successful, then entering a step five, and directly entering a step five if the matching is failed;

step five, matching the suffix information in the company name information according to a preset rule, if the matching is successful, converting the suffix information into the suffix information in a standard format, then entering step six, and if the matching is failed, directly entering step six;

step six, judging whether bill of lading data which can be combined with the bill of lading data after finishing step five exists in the database, if so, carrying out data combination;

step seven, extracting the next effective bill of lading data from the original customs data, and entering step two.

In the sixth step, the step of determining whether the database has bill of lading data that can be merged with the bill of lading data after the step five includes:

step 601, directly storing the bill of lading data after the step five into a database;

step 602, sorting all bill of lading data in the database according to company name information;

step 603, after the sorting is completed, extracting company name information in the bill of lading data adjacent to the newly stored bill of lading data, and extracting company name information in the newly stored bill of lading data;

step 604, performing similarity calculation on company name information in the latest stored bill of lading data and company name information in the adjacent bill of lading data; and if the similarity is greater than a threshold value, merging the latest stored bill of lading data with the adjacent bill of lading data.

In the above method for cleaning and merging customs data, in step 604, the similarity calculation is implemented by Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.

In the method for cleaning and merging customs data, in the fourth step, the region information in the company name information is matched according to a preset rule, and the matching is realized through regular matching.

The invention discloses a customs data cleaning and merging device, which comprises a bill of lading data extraction module, a company name information extraction module, a first judgment module, a first matching module, a second matching module and a second judgment module;

the bill of lading data extraction module is used for extracting effective bill of lading data from the original customs data;

the company name information extraction module is used for extracting company name information in the bill data extracted by the bill data extraction module;

the first judging module is used for judging whether the company name information extracted by the company name information extracting module is effective company name information, and if so, triggering the first matching module to operate and triggering the second matching module to operate;

the first matching module is used for matching the regional information in the company name information according to a preset rule, and deleting the regional information in the company name information if the matching is successful;

the second matching module is used for matching the suffix information in the company name information according to a preset rule, and if the matching is successful, the suffix information is converted into standard format suffix information;

the second judging module is used for judging whether the bill of lading data which can be combined with the bill of lading data processed by the first matching module and the second matching module exists in the database, if so, the data are combined, and the combined bill of lading data are stored in the database.

The second judging module comprises: the system comprises a data writing unit, a data sorting unit, a company name information extraction unit, a similarity calculation unit and a data merging unit;

the data writing unit is used for storing the bill of lading data processed by the first matching module and the second matching module into a database;

the data sorting unit is used for sorting all bill of lading data in the database according to company name information;

the company name information extraction unit is used for extracting company name information in the bill data adjacent to the newly stored bill data after the sorting is completed, and extracting company name information in the newly stored bill data;

the similarity calculation unit is used for calculating the similarity between the company name information in the latest stored bill of lading data and the company name information in the adjacent bill of lading data;

and the data merging unit is used for merging the latest stored bill of lading data with the adjacent bill of lading data when the similarity calculated by the similarity calculation unit is greater than a threshold value.

The similarity calculation algorithm in the similarity calculation unit is Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.

According to the customs data cleaning and merging device, the first matching module matches the regional information in the company name information according to the preset rule, and the matching is achieved through regular matching.

In a third aspect the invention discloses a terminal device for customs data cleansing and merging, comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the steps of the method according to the first aspect of the invention when said computer program is executed.

A fourth aspect of the invention discloses a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as disclosed in the first aspect of the invention.

Compared with the prior art, the invention has the following advantages:

1. the invention extracts effective bill of lading data from the original customs data and judges the company name information in the bill of lading data, cleans the bill of lading data of invalid company name information, and reduces the data quantity.

2. The method processes the company name information of the bill of lading data, deletes the area information in the company name information, and converts the suffix information in the company name information into the suffix information in the standard format, so that the company name information has a uniform format, and the business in the company name is highlighted, thereby being convenient for the accuracy of similarity calculation in the subsequent data combination.

3. The invention judges whether the bill of lading data can be combined by utilizing the similarity of the company name information, so that the bill of lading data of companies in different areas with the same business number can be combined, the number of the bill of lading is reduced, and one bill of lading can reflect more customs trade information.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is a flow chart of a method for data cleansing and merging according to the present invention.

FIG. 2 is a flowchart illustrating a method of step six in the data cleansing and merging method of the present invention.

FIG. 3 is a block diagram of a data cleansing and merging device according to the present invention.

Fig. 4 is a block diagram of a second determining module in the data cleansing and merging device according to the present invention.

Detailed Description

As shown in fig. 1, a method for cleaning and merging customs data includes the following steps:

extracting effective data from the original customs data by using a preset data field comparison table to form the whole effective bill of lading data, for example: the buyer can be found from the importer field and the vendor can be found from the exporter field. Therefore, no matter from which country customs the original customs data is subjected to data extraction, the bill of lading data finally extracted and generated is in a standard format, and the later data combination is convenient.

Step two, extracting company name information in bill of lading data;

judging whether the extracted company name information is valid company name information or not by using a preset invalid public library, for example: the company name is not available, and is invalid.

Step four, using regular matching to the regional information in the company name information, deleting the regional information in the company name information if the matching is successful, then entering step five, and directly entering step five if the matching is failed;

company area information is matched because companies will mostly have area information such as: after regular matching, the corresponding Shang Hai is company area information, and the area information is deleted to facilitate later similarity calculation, because a large group company can set up branch companies or sub-companies in different areas, and the word sizes of the branch companies or sub-companies in different areas are mostly similar or identical, after deleting the area information, the method is favorable for merging bill data of the branch companies or sub-companies set up in different areas belonging to a large group company.

Step five, regular matching is used for the suffix information in the company name information, and if the matching is successful, the suffix information is converted into standard format suffix information; then, entering a step six, and if the matching fails, directly entering the step six;

because part of the company name suffix is fully named, such as: "xxx Company Limited", so "xxx Company Limited" is converted into "xxx co., ltd.", which achieves the purpose of standardizing suffix information, thereby making company names more standard and uniform.

the purpose of reducing the quantity of bill of lading data and centralizing information is achieved by merging bill of lading data which can be merged, so that a user can know trade information of a company at each customs when finding the bill of lading data of the company without inquiring about each customs.

And repeating the second step to the sixth step for a plurality of times, so that the bill of lading data which is uniform in format and concentrated in information after being processed is stored in a database.

In the embodiment, as shown in fig. 2, in the step six, when judging whether there is bill of lading data in the database, which can be combined with the bill of lading data after the step five is completed, the method includes:

the bill of lading data after the step five is finished is stored in the database, and whether the bill of lading data can be combined with the bill of lading data after the step five is finished is judged.

since the subsequent similarity calculation is based on company name information, all bill of lading data in the database is firstly ordered according to the company name information, and after the ordering is finished, the bill of lading data which is most likely to be combined with the newly stored bill of lading data can be preliminarily judged as follows: and bill of lading data adjacent to the newly stored bill of lading data.

step 604, performing similarity calculation on company name information in the latest stored bill of lading data and company name information in the adjacent bill of lading data; and if the similarity is greater than a threshold value, merging the latest stored bill of lading data with the adjacent bill of lading data. The similarity is greater than a threshold, which may be 80% -90%.

After the sorting is finished, if the latest stored bill of lading data is arranged in the first position, the latest stored bill of lading data and the first bill of lading data arranged behind the latest stored bill of lading data are subjected to company name information similarity calculation; if the latest stored bill of lading data is arranged in the last position, carrying out company name information similarity calculation on the latest stored bill of lading data and the first bill of lading data arranged in front of the latest stored bill of lading data;

if the latest stored bill of lading data is not arranged in the last bit or in the last bit, carrying out company name information similarity calculation on the latest stored bill of lading data and the first bill of lading data arranged in front of the latest stored bill of lading data, and if the similarity is larger than a threshold value, carrying out data merging; if the similarity is not greater than the threshold value, the latest stored bill of lading data and the first bill of lading data arranged behind the latest stored bill of lading data are subjected to company name information similarity calculation.

In this embodiment, the similarity calculation in step 604 is implemented by Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.

The Levenstein Distance algorithm, NGram Distance algorithm and Jaro WinklerDistance algorithm are existing algorithms, and the principles are not described here in detail.

As shown in fig. 3, a customs data cleaning and merging device comprises a bill of lading data extraction module 1, a company name information extraction module 2, a first judgment module 3, a first matching module 4, a second matching module 5 and a second judgment module 6;

the bill of lading data extraction module 1 is used for extracting effective bill of lading data from original customs data;

the bill of lading data extraction module 1 utilizes the preset data field comparison table to extract the effective data from the original customs data to form the whole effective bill of lading data, such as: the buyer can be found from the importer field and the vendor can be found from the exporter field. Therefore, no matter from which country customs the original customs data is subjected to data extraction, the bill of lading data finally extracted and generated is in a standard format, and the later data combination is convenient.

The company name information extraction module 2 is used for extracting company name information in the bill data extracted by the bill data extraction module 1;

the first judging module 3 is configured to judge whether the company name information extracted by the company name information extracting module is valid company name information, and if so, trigger the first matching module 4 to operate and the second matching module 5 to operate;

the first judging module 3 judges whether the extracted company name information is valid company name information by using a preset "invalid public store", for example: the company name is not available, and is invalid.

The first matching module 4 is configured to use regular matching for the area information in the company name information, and if the matching is successful, delete the area information in the company name information;

The second matching module 5 is configured to use regular matching for suffix information in company name information, and if matching is successful, convert the suffix information into suffix information in standard format;

The second judging module 6 is configured to judge whether there is bill of lading data in the database, where the bill of lading data can be combined with the bill of lading data processed by the first matching module 4 and the second matching module 5, and if there is bill of lading data, perform data combination.

As shown in fig. 4, in this embodiment, the second determining module 6 includes: a data writing unit 61, a data sorting unit 62, a company name information extracting unit 63, a similarity calculating unit 64, and a data merging unit 65.

A data writing unit 61, configured to store the bill of lading data processed by the first matching module 4 and the second matching module 5 into a database;

a data sorting unit 62, configured to sort all bill of lading data in the database according to company name information;

the data sort unit 62 operates to sort once each time there is new bill of lading data stored in the database.

A company name information extracting unit 63 for extracting company name information in the bill data adjacent to the newly stored bill data after the sorting is completed, extracting company name information in the newly stored bill data;

after the data sorting unit 62 performs sorting, the latest stored bill of lading data obtains its own sequence position, and since the subsequent similarity calculation is based on company name information, all bill of lading data in the database are sorted according to the company name information, and after sorting is completed, it can be preliminarily determined that the bill of lading data most likely to be combined with the latest stored bill of lading data is: and bill of lading data adjacent to the newly stored bill of lading data.

A similarity calculation unit 64 for performing similarity calculation on company name information in the newly stored bill of lading data and company name information in the adjacent bill of lading data;

and a data merging unit 65, configured to, when the similarity calculated by the similarity calculation unit 64 is greater than a threshold value, perform data merging on the newly stored bill of lading data and the adjacent bill of lading data.

In the present embodiment, the similarity calculation algorithm in the similarity calculation unit 64 is Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.

The terminal device for customs data cleaning and merging comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the method steps of the customs data cleaning and merging when executing the computer program.

A computer readable storage medium storing a computer program which when executed by a processor performs the method steps of customs data cleansing merging described above.

The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and any simple modification, variation and equivalent structural changes made to the above embodiment according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. A method for cleaning and merging customs data, comprising the steps of:

step two, extracting company name information in bill of lading data;

2. The method for cleaning and merging customs data according to claim 1, wherein the step six of determining whether there is bill of lading data in the database that can be merged with the bill of lading data after the step five is completed comprises:

3. A method for cleaning and merging customs data according to claim 2, wherein the similarity calculation in step 604 is implemented by Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.

4. A method for cleaning and merging customs data according to claim 1, 2 or 3, wherein in the fourth step, the matching of the regional information in the company name information according to the preset rule is implemented by regular matching.

5. The customs data cleaning and merging device is characterized by comprising a bill of lading data extraction module, a company name information extraction module, a first judgment module, a first matching module, a second matching module and a second judgment module;

the second judging module is used for judging whether bill of lading data which can be combined with the bill of lading data processed by the first matching module and the second matching module exists in the database, and if so, data combination is carried out.

6. The apparatus for customs data cleansing and merging of claim 5, wherein said second determining module comprises: the system comprises a data writing unit, a data sorting unit, a company name information extraction unit, a similarity calculation unit and a data merging unit;

7. A device for cleaning and merging customs data according to claim 6, wherein the similarity calculation algorithm in the similarity calculation unit is Levenstein Distance algorithm, NGram Distance algorithm or Jaro Winkler Distance algorithm.

8. The device for cleaning and merging customs data according to claim 5, 6 or 7, wherein the first matching module matches the regional information in the company name information according to a preset rule, which is implemented by regular matching.

9. A customs data cleansing and merging terminal comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1-4 when executing the computer program.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any of claims 1-4.