CN113627132A - Data deduplication mark code generation method and system, electronic device and storage medium - Google Patents

Data deduplication mark code generation method and system, electronic device and storage medium Download PDF

Info

Publication number
CN113627132A
CN113627132A CN202110996617.3A CN202110996617A CN113627132A CN 113627132 A CN113627132 A CN 113627132A CN 202110996617 A CN202110996617 A CN 202110996617A CN 113627132 A CN113627132 A CN 113627132A
Authority
CN
China
Prior art keywords
bidding
data
feature
content
title
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110996617.3A
Other languages
Chinese (zh)
Other versions
CN113627132B (en
Inventor
刘瑞熙
王兆元
李青龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Smart Starlight Anhui Technology Co ltd
Original Assignee
Beijing Smart Starlight Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Smart Starlight Information Technology Co ltd filed Critical Beijing Smart Starlight Information Technology Co ltd
Priority to CN202110996617.3A priority Critical patent/CN113627132B/en
Publication of CN113627132A publication Critical patent/CN113627132A/en
Application granted granted Critical
Publication of CN113627132B publication Critical patent/CN113627132B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method, a system, electronic equipment and a storage medium for generating a data deduplication marking code, wherein the method comprises the following steps: obtaining a bidding title, bidding content, a bidding number, a bidding unit name and a bidding stage type of each bidding data according to the bidding data set; obtaining corresponding title features, content features, number features, unit name features and stage type features according to the multi-dimensional information; obtaining a data code corresponding to each bidding data according to the title feature, the content feature, the number feature, the unit name feature and the stage type feature; obtaining a group code corresponding to each bidding data according to the title feature, the content feature, the number feature and the unit name feature of each bidding data; and obtaining the de-duplication marking code corresponding to each bidding data according to the data code and the group code. The method can determine the repeated data through the repeated mark codes without similarity calculation, and improves the repeated data removing efficiency of the bidding data.

Description

Data deduplication mark code generation method and system, electronic device and storage medium
Technical Field
The invention relates to the field of deduplication by using multi-dimensional text features in data processing, in particular to a data deduplication mark code generation method, a data deduplication mark code generation system, electronic equipment and a storage medium.
Background
In a large-scale data deduplication mode, the deduplication is performed in a database searching calculation mode by using TF-IDF cosine similarity, the calculation time is too long under the condition of too large data volume, and the storage of large-batch streaming processing data is too slow; or the text is encoded through the simhash firstly, then the duplication is removed in a similarity calculation mode, when the simhash is used for processing semi-structured data of tendering and bidding, because the text is short, the simhash encoding characteristics of the whole text are few, the characteristic form of non-large text is not properly processed, the duplication removal effect is poor, the similarity calculation is needed, the calculation time is too long under the condition of too large data quantity, and the problem that the storage of large-batch streaming processing data is too slow exists.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method, a system, an electronic device, and a storage medium for generating a data deduplication marker, so as to solve the problem of low efficiency of data deduplication in bidding in the prior art.
Therefore, the embodiment of the invention provides the following technical scheme:
according to a first aspect, an embodiment of the present invention provides a method for generating a data deduplication mark code, including: acquiring a bidding data set, wherein the bidding data set comprises a plurality of collected bidding data; obtaining a bidding title, bidding content, a bidding number, a bidding unit name and a bidding stage type of each bidding data according to the bidding data set; obtaining a title characteristic corresponding to each bidding data according to the bidding title of each bidding data; obtaining content characteristics corresponding to each bidding data according to the bidding content of each bidding data; obtaining a numbering characteristic corresponding to each bidding data according to the bidding number of each bidding data; obtaining unit name characteristics corresponding to each bidding data according to the bidding unit name of each bidding data; obtaining a stage type characteristic corresponding to each bidding data according to the bidding stage type of each bidding data; obtaining a data code corresponding to each bidding data according to the title feature, the content feature, the number feature, the unit name feature and the stage type feature of each bidding data; obtaining a group code corresponding to each bidding data according to the title feature, the content feature, the number feature and the unit name feature of each bidding data; and obtaining the de-duplication marking code corresponding to each bidding data according to the data coding and the group coding of each bidding data.
Optionally, the step of obtaining the title feature corresponding to each bidding data according to the bidding title of each bidding data includes: acquiring a preset stage category dictionary; removing the stage type words in the bidding titles of each bidding data according to a preset stage category dictionary; segmenting the bidding titles of the removed stage type words to obtain title segments corresponding to each bidding data; respectively calculating the TFIDF value of each participle in the title participle corresponding to each bidding data; taking the participles with the TFIDF value higher by a first preset number as title extraction keywords; and sequencing the title extraction keywords according to a first preset sequence to obtain title sequencing keywords, and taking the title sequencing keywords as title characteristics corresponding to each bidding data.
Optionally, the step of obtaining the content characteristic corresponding to each bidding data according to the bidding content of each bidding data includes: respectively segmenting the bidding contents of each bidding data to obtain content segmentation words; removing stop words in the content participles according to a preset stop word dictionary; performing word frequency statistics on the content participles with the stop words removed, and taking the content participles with a second preset number and high word frequency as first content keywords; performing word length sequencing on the content participles with the stop words removed, and taking the content participles with the third preset number and high word length as second content keywords; using keywords which commonly appear in the first content keywords and the second content keywords as content extraction keywords; and sequencing the content extraction keywords according to a second preset sequence to obtain content sequencing keywords, and taking the content sequencing keywords as content characteristics corresponding to each bidding data.
Optionally, the obtaining of the data code corresponding to each bidding data according to the title feature, the content feature, the number feature, the unit name feature and the stage type feature of each bidding data includes: splicing the title feature, the content feature, the number feature, the unit name feature and the stage type feature corresponding to each bidding data according to a first preset splicing sequence to obtain a first splicing feature; and coding and encrypting the first splicing characteristics to obtain the data code of each bidding data.
Optionally, the step of obtaining the group code corresponding to each bidding data according to the title feature, the content feature, the number feature and the unit name feature of each bidding data includes: splicing the title feature, the content feature, the number feature and the unit name feature corresponding to each bidding data according to a second preset splicing sequence to obtain a second splicing feature; and coding and encrypting the second splicing characteristics to obtain the group code of each bidding data.
Optionally, after the step of obtaining the deduplication identifier corresponding to each bidding data according to the data encoding and the group encoding of each bidding data, the method further includes: acquiring a de-duplication requirement; and carrying out de-emphasis processing on the bidding data according to the de-emphasis demands and the de-emphasis mark codes corresponding to each piece of bidding data to obtain the de-emphasized bidding data.
Optionally, when the deduplication requirement is deduplication according to a data code in the deduplication marker, performing deduplication processing on the bidding data according to the deduplication requirement and the deduplication marker corresponding to each bidding data, to obtain the deduplicated bidding data, includes: coding and sequencing the bidding data according to the data codes; acquiring the collection and storage time of each bidding data; carrying out time sequencing on the bidding data with the same data code according to the collection and storage time; and reserving the bidding data with early collection and storage time in the bidding data with the same data codes, and taking the bidding data with early collection and storage time as the bidding data after de-weighting.
According to a second aspect, an embodiment of the present invention provides a data deduplication mark code generation system, including: the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a bidding data set which comprises a plurality of acquired bidding data; the first processing module is used for obtaining the bid title, the bid content, the bid number, the bid unit name and the bid stage type of each bid data according to the bid data set; the second processing module is used for obtaining the title characteristics corresponding to each bidding data according to the bidding title of each bidding data; the third processing module is used for obtaining the content characteristics corresponding to each bidding data according to the bidding content of each bidding data; the fourth processing module is used for obtaining the number characteristic corresponding to each bidding data according to the bidding number of each bidding data; the fifth processing module is used for obtaining the unit name characteristic corresponding to each bidding data according to the bidding unit name of each bidding data; the sixth processing module is used for obtaining the stage type characteristics corresponding to each bidding data according to the bidding stage type of each bidding data; the seventh processing module is used for obtaining a data code corresponding to each bidding data according to the title feature, the content feature, the number feature, the unit name feature and the stage type feature of each bidding data; the eighth processing module is used for obtaining a group code corresponding to each bidding data according to the title feature, the content feature, the numbering feature and the unit name feature of each bidding data; and the ninth processing module is used for obtaining the deduplication mark code corresponding to each bidding data according to the data code and the group code of each bidding data.
Optionally, the second processing module includes: the first acquisition unit is used for acquiring a preset stage category dictionary; the first processing unit is used for respectively removing the stage type words in the bidding titles of each bidding data according to the preset stage category dictionary; the second processing unit is used for segmenting the bidding titles of the removing stage type words to obtain the title segmentation corresponding to each bidding data; the third processing unit is used for respectively calculating the TFIDF value of each participle in the title participle corresponding to each bidding data; the fourth processing unit is used for taking the participles with the TFIDF value higher by the first preset number as title extraction keywords; and the fifth processing unit is used for sequencing the title extraction keywords according to a first preset sequence to obtain title sequencing keywords, and taking the title sequencing keywords as the title characteristics corresponding to each bidding data.
Optionally, the third processing module includes: the sixth processing unit is used for segmenting the bidding contents of each bidding data to obtain content segments; the seventh processing unit is used for removing stop words in the content participles according to the preset stop word dictionary; the eighth processing unit is used for carrying out word frequency statistics on the content participles from which the stop words are removed, and taking the content participles with high word frequency in a second preset number as first content keywords; the ninth processing unit is used for carrying out word length sequencing on the content participles with the stop words removed and taking the content participles with the word length higher by a third preset number as second content keywords; a tenth processing unit configured to take a keyword that appears in common in the first content keyword and the second content keyword as a content extraction keyword; and the eleventh processing unit is used for sequencing the content extraction keywords according to a second preset sequence to obtain content sequencing keywords, and taking the content sequencing keywords as content characteristics corresponding to each bidding data.
Optionally, the seventh processing module includes: the twelfth processing unit is used for splicing the title feature, the content feature, the number feature, the unit name feature and the stage type feature corresponding to each bidding data according to the first preset splicing sequence to obtain a first splicing feature; and the thirteenth processing unit is used for encoding and encrypting the first splicing characteristics to obtain the data code of each bidding data.
Optionally, the eighth processing module includes: the fourteenth processing unit is configured to splice the title feature, the content feature, the number feature and the unit name feature corresponding to each bidding data according to a second preset splicing sequence to obtain a second splicing feature; and the fifteenth processing unit is used for encoding and encrypting the second splicing characteristics to obtain the group code of each bidding data.
Optionally, the method further comprises: the second acquisition module is used for acquiring the deduplication requirement; and the tenth processing module is used for carrying out the de-emphasis processing on the bidding data according to the de-emphasis requirement and the de-emphasis mark code corresponding to each piece of bidding data to obtain the de-emphasized bidding data.
Optionally, when the deduplication requirement is deduplication according to data encoding in a deduplication marker code, the tenth processing module includes: the sixteenth processing unit is used for carrying out coding sorting on the bidding data according to the data codes; the second acquisition unit is used for acquiring the collection and storage time of each bidding data; the seventeenth processing unit is used for carrying out time sequencing on the bidding data with the same data codes according to the collection and storage time; and the eighteenth processing unit is used for reserving the bid data with the earlier collection and storage time in the bid data with the same data codes and taking the bid data with the earlier collection and storage time as the bid data after the de-emphasis.
According to a third aspect, an embodiment of the present invention provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the method of generating a data de-duplication marking code as described in any one of the above first aspects.
According to a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to cause a computer to execute the data deduplication mark code generation method described in any one of the first aspect above.
The technical scheme of the embodiment of the invention has the following advantages:
the embodiment of the invention provides a method, a system, electronic equipment and a storage medium for generating a data deduplication marking code, wherein the method comprises the following steps: acquiring a bidding data set, wherein the bidding data set comprises a plurality of collected bidding data; obtaining a bidding title, bidding content, a bidding number, a bidding unit name and a bidding stage type of each bidding data according to the bidding data set; obtaining a title characteristic corresponding to each bidding data according to the bidding title of each bidding data; obtaining content characteristics corresponding to each bidding data according to the bidding content of each bidding data; obtaining a numbering characteristic corresponding to each bidding data according to the bidding number of each bidding data; obtaining unit name characteristics corresponding to each bidding data according to the bidding unit name of each bidding data; obtaining a stage type characteristic corresponding to each bidding data according to the bidding stage type of each bidding data; obtaining a data code corresponding to each bidding data according to the title feature, the content feature, the number feature, the unit name feature and the stage type feature of each bidding data; obtaining a group code corresponding to each bidding data according to the title feature, the content feature, the number feature and the unit name feature of each bidding data; and obtaining the de-duplication marking code corresponding to each bidding data according to the data coding and the group coding of each bidding data. Determining a bid title, bid contents, a bid number, a bid unit name and a bid stage type corresponding to each bid data according to the acquired bid data set; secondly, determining title characteristics according to the bidding titles, content characteristics according to the bidding contents, number characteristics according to the bidding numbers, number characteristics according to the names of bidding units and stage type characteristics according to the types of the bidding stages; then, obtaining a data code corresponding to each bidding data according to the title feature, the content feature, the number feature, the unit name feature and the stage type feature; obtaining a group code corresponding to each bidding data according to the title feature, the content feature, the numbering feature and the unit name feature; and finally, the data codes and the group codes are used as the deduplication marker corresponding to each bidding data, deduplication processing of the bidding data is carried out through the deduplication marker, repeated data can be determined without similarity calculation, and deduplication efficiency of the bidding data is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a specific example of a data deduplication mark code generation method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating another specific example of a data deduplication mark code generation method according to an embodiment of the present invention;
FIG. 3 is a block diagram of a specific example of a data deduplication mark code generation system of an embodiment of the present invention;
fig. 4 is a schematic diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An embodiment of the present invention provides a data deduplication mark code generation method, as shown in fig. 1, the method includes steps S1-S10.
Step S1: a bid data set is obtained that includes a plurality of collected bid data.
In the embodiment, data acquisition is performed on bidding data through a web crawler, and a plurality of acquired bidding data form a bidding data set; of course, in other embodiments, the bid data may also be obtained by other techniques in the prior art, for example, the bid data is obtained through a commercial interface and is set reasonably according to actual needs.
Step S2: and obtaining the bidding title, the bidding content, the bidding number, the name of the bidding unit and the type of the bidding stage of each bidding data according to the bidding data set.
In this embodiment, each bidding data in the bidding data set is subjected to title extraction, so as to obtain a bidding title corresponding to each bidding data. The specific extraction method of the bidding headline can be that the bidding headline is collected from the back end, and the content and the headline corresponding to the bidding data are directly extracted from the queue by algorithm processing; of course, in other embodiments, other title extraction methods in the prior art may be used to obtain the bid title of the bid and bid data; this is only illustrated schematically in the present embodiment, and is not limited thereto.
In this embodiment, content extraction is performed on each bidding data in the bidding data set, so as to obtain bidding content corresponding to each bidding data. The specific extraction method of the bidding contents can be that the bidding contents are acquired from the back end, and the contents and the titles corresponding to the bidding data are directly extracted from the queue by algorithm processing; of course, in other embodiments, other content extraction methods in the prior art may be adopted to obtain bidding content of the bidding data; this is only illustrated schematically in the present embodiment, and is not limited thereto.
In this embodiment, the bid designation number refers to a bid document item number, each bid designation item has a unique item number, and different bids are distinguished by the item numbers. And respectively extracting the item numbers of each bidding data in the bidding data set to obtain the bidding numbers corresponding to each bidding data. The specific extraction method of the bid inviting numbers can be that regular expressions are used, the bid inviting data are semi-structured data, each bid corresponds to a unique bid inviting number, and the text content describing the bid inviting numbers is relatively fixed. As' bid number: 2019-10-09-Guangdong-character MNQ 'or' project number: 2019-10-09-Guangdong-character MNQ', phrase records describing bid numbers which fixedly appear in bid data are extracted from bid contents in a regular mode; of course, in other embodiments, other item number extraction methods in the prior art may also be adopted to obtain the bid number of the bid and bid data; this is only illustrated schematically in the present embodiment, and is not limited thereto.
In this embodiment, a bidding unit is extracted from each bidding data set to obtain a name of the bidding unit corresponding to each bidding data set. The specific extraction method of the bidding unit name can be regular expression extraction, and text content beginning with ' tenderer ' bidding unit ' in the bidding content is extracted by using the regular expression to serve as the bidding unit; of course, in other embodiments, other unit name extraction methods in the prior art may also be adopted to obtain the bidding unit name of the bidding data; this is only illustrated schematically in the present embodiment, and is not limited thereto.
In this embodiment, the bidding stage types refer to different stages in the bidding process, and the specific bidding stage types are divided into two major categories, namely bidding announcement and bidding result, and each category is divided into more than 20 minor categories, such as consultation, negotiation, inquiry, correction, bargain, and bidding. Obtaining a keyword corresponding to each small type through statistics of a large amount of historical bidding data; and then mapping the keywords in each category and the corresponding bidding stage type to generate a preset stage category dictionary in advance. The specific classification of the bidding stage types in this embodiment is only schematically described, and is not limited thereto.
Specifically, the process for determining the bidding stage type includes that when the stage type words of the title are removed, the stage type words corresponding to the title are reserved, the reserved stage type words are used as the stage type keywords corresponding to the bidding data, the stage type keywords are searched in a pre-mapped preset stage type dictionary, the stage type corresponding to the stage type keywords is found, and then the bidding stage type is obtained.
Step S3: and obtaining the title characteristics corresponding to each bidding data according to the bidding title of each bidding data.
In this embodiment, through the analysis of the bidding data, the repeated conditions are divided into two types, one is completely the same, and the other is different stages of the same bid, for example: the dynamic electricity modification project bidding announcement of a certain company simulated training building and the dynamic electricity modification project bidding announcement of a certain company are the same data, the bidding numbers and other related contents are the same, but the data cannot be counted repeatedly during data processing, so that the bidding stage type is involved, and the stage type is used for representing different life cycles of the same target, so that more values are provided in subsequent data display.
Words related to the phase type are generally included in the title of bidding data, for example, the "bidding announcement" in the title "a team simulates training building strong and weak electricity transformation project bidding announcement" indicates the phase type of the bid; the "winning bid bulletin" in the title "winning bid bulletin in for a team simulation training building strong and weak electricity transformation project" indicates the type of the target phase.
In this embodiment, since the bidding stage type is a feature for representing different stages of the bid, the stage type does not need to be considered in the title, and words of the stage type mentioned in the title are removed before the title keyword is extracted. And then, respectively extracting keywords of the bidding titles with the stage type words removed corresponding to each bidding data to obtain title keywords, and arranging the title keywords according to a certain sequence to be used as the title characteristics corresponding to each bidding data.
Step S4: and obtaining the content characteristics corresponding to each bidding data according to the bidding content of each bidding data.
In this embodiment, the bidding contents of each bidding data are segmented, word frequency statistics is performed after the segmentation, M keywords of top M before the word frequency value are taken, the words are sorted according to word length, N words of top N before the word length are taken, finally, a word which commonly appears in the M keywords of the word frequency top M and the N words of the word length top N is taken as a target content keyword, the target content keyword is put into a list to be output, and the content keywords are arranged according to a certain sequence and then are used as content characteristics corresponding to each bidding data.
Step S5: and obtaining the numbering characteristic corresponding to each bidding data according to the bidding number of each bidding data.
In this embodiment, the bidding code of each acquired bidding data is used as the numbering feature corresponding to the bidding data.
Step S6: and obtaining unit name characteristics corresponding to each bidding data according to the bidding unit name of each bidding data.
In this embodiment, in general, the bidding units with the same bid are the same, and the name of the bidding unit is used as the judgment basis, so as to improve the accuracy of data duplication judgment. And respectively taking the bidding unit name of each acquired bidding data as a unit name characteristic corresponding to the bidding data.
Step S7: and obtaining the stage type characteristics corresponding to each bidding data according to the bidding stage type of each bidding data.
In this embodiment, the bidding stage type of each acquired bidding data is used as the stage type feature corresponding to the bidding data.
Step S8: and obtaining the data code corresponding to each bidding data according to the title feature, the content feature, the number feature, the unit name feature and the stage type feature of each bidding data.
In this embodiment, the title feature, the content feature, the number feature, the unit name feature and the stage type feature corresponding to each bidding data are respectively spliced according to a certain sequence, and the spliced multidimensional feature is encoded, specifically md5 encoding; and obtaining a data code corresponding to each bidding data after coding.
Step S9: and obtaining the group code corresponding to each bidding data according to the title feature, the content feature, the number feature and the unit name feature of each bidding data.
In this embodiment, the title feature, the content feature, the number feature, and the unit name feature corresponding to each bidding data are respectively spliced according to a certain sequence, and the spliced multidimensional feature is encoded, specifically md5 encoding; and obtaining a group code corresponding to each bidding data after coding.
Step S10: and obtaining the de-duplication marking code corresponding to each bidding data according to the data coding and the group coding of each bidding data.
In this embodiment, one piece of bidding data corresponds to one data code and one group code, the data code and the group code corresponding to the bidding data are used as the deduplication codes of the bidding data, and then, the data in the bidding database can be deduplicated according to the deduplication codes. Specifically, two codes, namely unique _ code and group _ code, are printed on all data of the bidding database, and the bidding database can be subjected to grouping and de-duplication processing according to one code of unique _ code; the bidding database can also be subjected to grouping and de-duplication processing according to two codes, namely group _ code and unique _ code, and the processing can show different stages of the same bid.
Determining a bid title, bid contents, a bid number, a bid unit name and a bid stage type corresponding to each bid data according to the acquired bid data set; secondly, determining title characteristics according to the bidding titles, content characteristics according to the bidding contents, number characteristics according to the bidding numbers, number characteristics according to the names of bidding units and stage type characteristics according to the types of the bidding stages; then, obtaining a data code corresponding to each bidding data according to the title feature, the content feature, the number feature, the unit name feature and the stage type feature; obtaining a group code corresponding to each bidding data according to the title feature, the content feature, the numbering feature and the unit name feature; and finally, the data codes and the group codes are used as the deduplication marker corresponding to each bidding data, deduplication processing of the bidding data is carried out through the deduplication marker, repeated data can be determined without similarity calculation, and deduplication efficiency of the bidding data is improved.
As an exemplary embodiment, the step S3 of obtaining the title feature corresponding to each bidding data according to the bidding title of each bidding data includes steps S31-S36.
S31: and acquiring a preset stage category dictionary.
In this embodiment, the phase types refer to different life cycles of the same target, that is, different phases of the same target in the bid inviting process. The specific bidding stage types include two major types, namely bidding bulletin and bidding result, and each major type includes more than 20 minor types, such as consultation, negotiation, inquiry, correction, bargaining and bidding. Each small type includes a plurality of keywords representing corresponding types, and the keywords are obtained by counting a large amount of historical bidding data. And corresponding the keywords in the small types with the large types, wherein each large type corresponds to one phase type, so that a mapping relation between the phase type keywords and the phase types is formed, a mapping dictionary of the phase type keywords and the phase types is generated, and the pre-mapped dictionary is a preset phase category dictionary.
S32: and respectively removing the stage type words in the bidding titles of each bidding data according to the preset stage category dictionary.
In this embodiment, the words in the bidding titles of each bidding data are respectively compared with the periodic keywords in the preset stage category dictionary, and when the periodic keywords in the preset stage category dictionary appear in the bidding titles, the periodic keywords in the titles are removed.
In this embodiment, after removing the stage type word, the title cannot be directly transcoded as a whole. The bidding title has a problem that the information of the target belongs to different websites, and when one or more words are added or deleted, the displayed contents are different, but the information of the target is the same target in practice; in addition, in the actual database, such duplicate information is slightly larger, so that keywords need to be extracted in the subsequent steps for the title.
S33: and segmenting the bidding titles of the removed stage type words to obtain the title segmentation corresponding to each bidding data.
In this embodiment, after removing the stage type words from the bidding titles, word segmentation is performed, and the title word segmentation of each bidding title is obtained after word segmentation. The specific word segmentation method can be word segmentation at the end; of course, in other embodiments, other word segmentation methods in the prior art may also be adopted, and this is only described schematically in this embodiment, and is not limited thereto.
S34: and respectively calculating the TFIDF value of each participle in the title participle corresponding to each bidding data.
In this embodiment, word frequency statistics is performed on the title participles, and the occurrence frequency and the inverse document frequency of each participle are calculated to obtain the TFIDF value of each participle.
S35: and taking the participles with the high TFIDF value of the first preset number as title extraction keywords.
In this embodiment, the first preset number may be 3, that is, 3 segmentations with high TFIDF values are selected as the title extraction keywords. Of course, in other embodiments, the first preset number may also be 2 or 4; in this embodiment, the first preset number is only schematically described, and is not limited thereto, and may be reasonably set as required in practical application.
Specifically, the word segments in the title word segments are arranged in ascending order or descending order according to the TFIDF value, and then 3 word segments with high TFIDF values are taken as title extraction keywords corresponding to the bidding titles.
S36: and sequencing the title extraction keywords according to a first preset sequence to obtain title sequencing keywords, and taking the title sequencing keywords as title characteristics corresponding to each bidding data.
In this embodiment, the first predetermined sequence may be the first letter sequence of the pinyin of the Chinese characters; of course, in other embodiments, the first predetermined sequence may be the number of strokes of the chinese character. This is only schematically described in the present embodiment, and is not limited thereto.
In this embodiment, the title extraction keywords are arranged according to a first preset order, the arranged title extraction keywords are title sorting keywords corresponding to the bid-on title, and the title sorting keywords corresponding to the bid-on title are used as title features of the bid-on title.
Removing words related to stage types according to a preset stage category dictionary when the title is processed, wherein the preset stage category dictionary uses a stage category dictionary table obtained through data statistics in the early stage; after the stage type words are removed, keywords are extracted from the title, a first preset number of keywords are extracted from the title in a TFIDF mode to serve as title extraction keywords, the title extraction keywords are sequenced, and the sequenced title extraction keywords serve as bidding title feature items to be transcoded. Through the steps, word noise caused by different website to the title characteristics can be shielded.
As an exemplary embodiment, the step S4 of obtaining the content characteristic corresponding to each bidding data according to the bidding content of each bidding data includes steps S41-S46.
Step S41: and respectively segmenting the bidding contents of each bidding data to obtain content segmentation words.
In this embodiment, from the viewpoint of the subject text content, also for the reason of acquisition, although the text content describes the same subject content, the whole article is not completely the same, for example, the typesetting format is different or the header end is different. In addition, since the length of the bidding data content is generally about 500 words, it is considered that the current bidding content is represented by the content keyword.
In this embodiment, the bidding contents are subjected to word segmentation, and content word segmentation corresponding to the bidding contents of each bidding data is obtained after word segmentation. The specific word segmentation method can be word segmentation at the end; of course, in other embodiments, other word segmentation methods in the prior art may also be adopted, and this is only described schematically in this embodiment, and is not limited thereto.
Step S42: and removing stop words in the content participles according to the preset stop word dictionary.
In this embodiment, the preset stop word dictionary is obtained by counting a large amount of historical bid and ask contents. Specifically, the preset stop word dictionary may be a word bank of stop words in haardson; or a Sichuan university machine learning intelligent laboratory deactivation word bank; and a hundred degree stop word list can be used. This is only schematically described in the present embodiment, and is not limited thereto; in other embodiments, other disabled word lists can be used, and the disabled word lists can be reasonably set according to needs.
Step S43: and performing word frequency statistics on the content participles with the stop words removed, and taking the content participles with the second preset number and high word frequency as first content keywords.
In this embodiment, the second preset number may be 5, that is, 5 participles with high word frequency values are selected as the first content keyword. Of course, in other embodiments, the second preset number may also be 4 or 6; in this embodiment, the second preset number is only schematically described, and is not limited to this, and may be reasonably set as required in practical application.
In this embodiment, word frequency statistics is performed on the content participles from which stop words are removed, the word frequency of each participle is calculated, the word frequencies are compared, and 5 participles with high word frequency values are selected as first content keywords corresponding to the bid-for-bid content.
Step S44: and performing word length sequencing on the content participles with the stop words removed, and taking the content participles with the third preset number and the high word length as second content keywords.
In this embodiment, the third preset number may be 10; of course, in other embodiments, the third preset number may also be 8 or 12. In this embodiment, the third preset number is only schematically described, and is not limited to this, and may be reasonably set as required in practical application.
In this embodiment, word length statistics is performed on the content participles from which stop words are removed, the word length of each content participle is counted, and 10 content participles with a high word length are used as the second content keywords.
Step S45: and taking keywords which appear together in the first content keywords and the second content keywords as content extraction keywords.
In this embodiment, the first content keyword and the second content keyword are compared to find a commonly occurring keyword, and the commonly occurring keyword is used as a content extraction keyword corresponding to the bid content.
Step S46: and sequencing the content extraction keywords according to a second preset sequence to obtain content sequencing keywords, and taking the content sequencing keywords as content characteristics corresponding to each bidding data.
In this embodiment, the second predetermined sequence may be the first letter sequence of the pinyin of the Chinese characters; of course, in other embodiments, the second predetermined sequence may be the number of strokes of the chinese character. This is only schematically described in the present embodiment, and is not limited thereto.
In this embodiment, the content extraction keywords are arranged according to a second preset order, the arranged content extraction keywords are content sorting keywords corresponding to the bid contents, and the content sorting keywords corresponding to the bid contents are used as content features of the bid contents.
Firstly, segmenting words of the target content; stopping using words after word segmentation so that the segmented words have strong bidding characteristics; calculating the word frequency of each participle, and taking a second preset number of participles with high word frequency as first content participles; sorting according to word length, and taking the participles with the word length of the third preset number as second content participles; then, taking a word which commonly appears in the first content participle and the second content participle as a target content extraction keyword; and sequencing the content extraction keywords, and taking the sequenced content extraction keywords as bidding content characteristic items to be transcoded. The interference caused by different websites to the content characteristics can be shielded through the steps.
As an exemplary embodiment, the step S8 of obtaining the data code corresponding to each bidding data according to the title feature, the content feature, the number feature, the unit name feature and the stage type feature of each bidding data includes steps S81-S82.
Step S81: and splicing the title feature, the content feature, the number feature, the unit name feature and the stage type feature corresponding to each bidding data according to a first preset splicing sequence to obtain a first splicing feature.
In this embodiment, the first preset splicing order may be a title feature W1, a content feature W2, a number feature W3, a unit name feature W4, and a stage type feature W5; of course, in other embodiments, the first preset stitching order may also be other orders, such as the number feature W3, the unit name feature W4, the title feature W1, the content feature W2, and the stage type feature W5. This is only schematically described in this embodiment, which is not limited to this, and it is reasonable to set this in practical application as needed.
In this embodiment, all the above features are spliced in the form of a character string without using a steering amount, and the first splicing feature is result 1 ═ W1+ W2+ W3+ W4+ W5.
Step S82: and coding and encrypting the first splicing characteristics to obtain the data code of each bidding data.
In this embodiment, the encoding encryption method is md5 encoding; of course, in other embodiments, the encoding encryption mode may also be other encryption methods in the prior art, such as sha256 encryption and HMAC encryption. This is only schematically described in this embodiment, which is not limited to this, and it is reasonable to set this in practical application as needed.
In this embodiment, the first splicing characteristic is result 1 ═ W1+ W2+ W3+ W4+ W5; the data obtained after the first splicing characteristic is coded as follows: unique _ code ═ MD5(result 1).
The method comprises the following steps of directly splicing the characteristics of all dimensions corresponding to each bidding data without vector conversion, obtaining first splicing characteristics after splicing, and coding the first splicing characteristics to obtain data codes; a set of encoding modes is developed aiming at the specific data form of bidding, the data encoding is put into a warehouse along with the data in the form of tags, the judgment of the repeated data can be carried out subsequently according to the data encoding, the repeated data can be determined without a calculation mode, and the data deduplication efficiency is improved.
As an exemplary embodiment, the step S9 of obtaining the group code corresponding to each bidding data according to the title feature, the content feature, the number feature and the unit name feature of each bidding data includes steps S91-S92.
Step S91: and splicing the title feature, the content feature, the number feature and the unit name feature corresponding to each bidding data according to a second preset splicing sequence to obtain a second splicing feature.
In this embodiment, the second preset splicing order may be a title feature W1, a content feature W2, a number feature W3, and a unit name feature W4; of course, in other embodiments, the second preset stitching order may also be other orders, such as the number feature W3, the unit name feature W4, the title feature W1, and the content feature W2. This is only schematically described in this embodiment, which is not limited to this, and it is reasonable to set this in practical application as needed.
In this embodiment, the plurality of features are spliced in the form of a character string without using a steering amount, and the second splicing feature is result 2 ═ W1+ W2+ W3+ W4.
Step S92: and coding and encrypting the second splicing characteristics to obtain the group code of each bidding data.
In this embodiment, the encoding encryption method is md5 encoding; of course, in other embodiments, the encoding encryption mode may also be other encryption methods in the prior art, such as sha256 encryption and HMAC encryption. This is only schematically described in this embodiment, which is not limited to this, and it is reasonable to set this in practical application as needed.
In this embodiment, the second splicing characteristic is result 2 ═ W1+ W2+ W3+ W4; the group code obtained after coding the second splicing feature is: group _ code is MD5(result 2). The block coding, without considering the target phase type characteristic W5, groups all the repeated contents into one group; if the group code and the data code unique code are used in combination, the same target life cycle can be exhibited.
The multiple characteristics corresponding to each bidding data are directly spliced without vector conversion, second splicing characteristics are obtained after splicing, and the second splicing characteristics are coded to obtain group codes; and the group code does not consider the stage type characteristics of the target, the group code is put in a database along with data in a tag form, and different stages of the same target can be checked according to the group code subsequently.
As an exemplary embodiment, steps S11-S12 are included after the step of obtaining the de-duplication label code corresponding to each bidding data according to the data encoding and group encoding of each bidding data in step S10.
Step S11: and acquiring the de-duplication requirement.
In this embodiment, the deduplication requirements are determined according to customer requirements. Specifically, the deduplication requirement may be deduplication according to data encoding in the deduplication mark code; or the duplication can be removed according to the group code and the data code in the duplication removing mark code.
Step S12: and carrying out de-emphasis processing on the bidding data according to the de-emphasis demands and the de-emphasis mark codes corresponding to each piece of bidding data to obtain the de-emphasized bidding data.
In this embodiment, two codes, i.e., unique _ code and group _ code, are used to code all bidding data in the bidding database.
When the deduplication requirement is deduplication according to the data coding in the deduplication mark code, only the data coding in the deduplication mark code is used in data deduplication, that is, only the data coding is used for deduplication processing of the bid and bid data.
When the deduplication requirement is deduplication according to the group code and the data code in the deduplication mark code, the group code and the data code in the deduplication mark code are used during data deduplication, the group code group _ code is firstly used for grouping, data belonging to the same target are found, and different stages of the same target are divided into one group; the groups are then grouped using the data encoding unique _ code, and the life cycle of such a label is shown.
According to the steps, the bid data are subjected to the deduplication processing according to the deduplication requirements and the deduplication mark codes corresponding to each bid data, and the flexibility and diversity of data processing are improved.
As an exemplary embodiment, when the deduplication requirement is deduplication according to a data encoding in the deduplication codes, the step S12 performs deduplication processing on the bidding data according to the deduplication requirement and the deduplication codes corresponding to each piece of bidding data to obtain the deduplicated bidding data, which includes steps S121 to S124.
Step S121: and carrying out coded sorting on the bidding data according to the data codes.
In this embodiment, the bid data with the same data code is the repeated data, the data codes of each bid data are sorted, and the bid data with the same data code is found, so as to deduplicate the bid data with the same data code.
Step S122: and acquiring the collection and storage time of each bidding data.
In this embodiment, when the bidding data is collected and stored in the storage, the collection and storage time of each bidding data needs to be recorded, and the same bidding data is removed according to the storage time.
Step S123: and carrying out time sequencing on the bidding data with the same data code according to the collection and storage time.
In this embodiment, the bid and tender data with the same data code are sorted according to the collection and storage time, and the specific time sorting mode may be a time sequence from early to late, or a time sequence from late to early, and may be set reasonably according to actual needs.
Step S124: and reserving the bidding data with early collection and storage time in the bidding data with the same data codes, and taking the bidding data with early collection and storage time as the bidding data after de-weighting.
In this embodiment, of the bidding data having the same data code, the bidding data having the earliest collection and storage time is retained, and other repeated data is removed, and the retained bidding data having the earliest collection and storage time is the re-duplication removed bidding data.
In the steps, the data deduplication is performed on the bidding data with the same data codes according to the collection and storage time, and the bidding data with the earliest collection and storage time is used as the bidding data after the deduplication, so that the removal of the repeated data is realized.
A detailed description is given below by using a specific example, as shown in fig. 2, fig. 2 is a flow of parsing, transcoding and entering a text after text input.
The method comprises the steps of firstly determining the repetition of a plurality of dimension judgment targets, judging whether two targets are the same content or not by a plurality of characteristics in the target content, and selecting a bid inviting number as a judgment condition of one dimension in consideration of the characteristics of the targets because each target bid inviting code is unique.
The second dimension is selected from the text content, firstly the title, and through the analysis of the bidding data, the repeated condition is divided into two kinds, one is completely the same, and the other is different stages of the same bid, such as: the data are the same target content, and the target number and other related contents are the same, but the data cannot be counted as repetition in business processing, so that another tag code-group _ code is involved, and the group _ code is used for setting different life cycles of the same target so as to provide more value in subsequent data display. And continuing to return to the content selection of the second dimension, considering the stage characteristics of the targets, removing keywords related to the stages when processing the titles, and after removing the stage keywords by using a stage category dictionary table (which is divided into 20 types such as bidding bulletins and bidding results from the major categories, each category is divided into consultation, negotiation, inquiry, query, correction, deal, and bid, and the like) subjected to data statistics in the early stage, extracting the keywords from the titles, wherein the titles cannot be directly transcoded as a whole. When the target information belongs to different website collection, the content displayed by adding or deleting one or more words is actually the same target. In the actual database, the repeated information accounts for a little bit, so the second processing aiming at the title is to extract keywords, and three keywords are extracted from the title by using a TFIDF mode to be used as title feature items to be transcoded, so that word noises caused by different websites are shielded.
The third dimension is considered from the subject body content, again for acquisition reasons, and although the body content describes the same subject content, the article as a whole is not identical, e.g., different in layout format or different in header end, etc. In addition, the length of the bidding data content is generally about 500 words, so that the current bidding content can be represented by keywords.
The fourth dimension is the use of item types, and if the item types are the same, the judgment of whether the item types are the same target can be assisted.
The fifth dimension is a bidding unit, the extracted bidding unit is used as a judgment basis, and generally, the same bidding unit of the same bidding content is the same home.
The above is to determine five dimensions to judge whether the target is repeated or not.
Specifically, the title removes the stage type words according to the stage type dictionary of the target, then uses TFIDF to extract three keywords, and uses the list as the output form as W1; dividing target content into words, removing stop words, adding a bidding related dictionary (obtained by data statistics) during word division so that the divided words have strong bidding characteristics, calculating word frequency, taking keywords which are 5 bits before the word frequency, sequencing according to word length, taking 10 bits before, and finally taking words which commonly appear in word frequency top5 and word length top10 as target content keywords to be put into a list to be output and represented by W2; the bid number is W3; the bidding unit is W4; and finally, outputting the item type as W5, wherein the item type is obtained by mapping a corresponding type dictionary table reserved when the title is subjected to item type keyword removal.
All the above features are spliced in a character string without using a steering amount, and result 1 is W1+ W2+ W3+ W4+ W5, and unique _ code is MD5(result 1). Similarly, result 2 ═ W1+ W2+ W3+ W4, group _ code ═ MD5(result 2). That is, the block code, regardless of the target phase type W5, groups all the repetitive contents into one group.
And marking two codes of unique _ code and group _ code on all data of the database, if the grouping condition is only according to the unique _ code, dividing the data with the same unique _ code in the database into a group, and selecting one of the data according to time to complete the de-duplication.
If the group _ code is used, the same target life cycle can be displayed, the group _ code is used for grouping, different stages of the current target are grouped into one group after the group _ code is used, and the unique _ code is used for grouping in the group to display the target life cycle.
In this embodiment, a data deduplication marker generation system is further provided, and the system is used to implement the foregoing embodiments and preferred embodiments, and the description of which has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
The present embodiment further provides a data deduplication mark code generating system, as shown in fig. 3, including:
a first obtaining module 1, configured to obtain a bid data set, where the bid data set includes a plurality of collected bid data;
the first processing module 2 is used for obtaining the bid title, the bid content, the bid number, the bid unit name and the bid stage type of each bid data according to the bid data set;
the second processing module 3 is used for obtaining the title characteristics corresponding to each bidding data according to the bidding title of each bidding data;
the third processing module 4 is configured to obtain a content characteristic corresponding to each bidding data according to the bidding content of each bidding data;
the fourth processing module 5 is configured to obtain a numbering feature corresponding to each bidding data according to the bidding number of each bidding data;
the fifth processing module 6 is configured to obtain a unit name feature corresponding to each bidding data according to the bidding unit name of each bidding data;
a sixth processing module 7, configured to obtain, according to the bidding stage type of each bidding data, a stage type feature corresponding to each bidding data;
the seventh processing module 8 is configured to obtain a data code corresponding to each bidding data according to the title feature, the content feature, the number feature, the unit name feature, and the stage type feature of each bidding data;
an eighth processing module 9, configured to obtain a group code corresponding to each bidding data according to the title feature, the content feature, the number feature, and the unit name feature of each bidding data;
and a ninth processing module 10, configured to obtain, according to the data code and the group code of each bidding data, a deduplication identifier corresponding to each bidding data.
As an exemplary embodiment, the second processing module includes: the first acquisition unit is used for acquiring a preset stage category dictionary; the first processing unit is used for respectively removing the stage type words in the bidding titles of each bidding data according to the preset stage category dictionary; the second processing unit is used for segmenting the bidding titles of the removing stage type words to obtain the title segmentation corresponding to each bidding data; the third processing unit is used for respectively calculating the TFIDF value of each participle in the title participle corresponding to each bidding data; the fourth processing unit is used for taking the participles with the TFIDF value higher by the first preset number as title extraction keywords; and the fifth processing unit is used for sequencing the title extraction keywords according to a first preset sequence to obtain title sequencing keywords, and taking the title sequencing keywords as the title characteristics corresponding to each bidding data.
As an exemplary embodiment, the third processing module includes: the sixth processing unit is used for segmenting the bidding contents of each bidding data to obtain content segments; the seventh processing unit is used for removing stop words in the content participles according to the preset stop word dictionary; the eighth processing unit is used for carrying out word frequency statistics on the content participles from which the stop words are removed, and taking the content participles with high word frequency in a second preset number as first content keywords; the ninth processing unit is used for carrying out word length sequencing on the content participles with the stop words removed and taking the content participles with the word length higher by a third preset number as second content keywords; a tenth processing unit configured to take a keyword that appears in common in the first content keyword and the second content keyword as a content extraction keyword; and the eleventh processing unit is used for sequencing the content extraction keywords according to a second preset sequence to obtain content sequencing keywords, and taking the content sequencing keywords as content characteristics corresponding to each bidding data.
As an exemplary embodiment, the seventh processing module includes: the twelfth processing unit is used for splicing the title feature, the content feature, the number feature, the unit name feature and the stage type feature corresponding to each bidding data according to the first preset splicing sequence to obtain a first splicing feature; and the thirteenth processing unit is used for encoding and encrypting the first splicing characteristics to obtain the data code of each bidding data.
As an exemplary embodiment, the eighth processing module includes: the fourteenth processing unit is configured to splice the title feature, the content feature, the number feature and the unit name feature corresponding to each bidding data according to a second preset splicing sequence to obtain a second splicing feature; and the fifteenth processing unit is used for encoding and encrypting the second splicing characteristics to obtain the group code of each bidding data.
As an exemplary embodiment, further comprising: the second acquisition module is used for acquiring the deduplication requirement; and the tenth processing module is used for carrying out the de-emphasis processing on the bidding data according to the de-emphasis requirement and the de-emphasis mark code corresponding to each piece of bidding data to obtain the de-emphasized bidding data.
As an exemplary embodiment, when the deduplication requirement is deduplication according to data encoding in a deduplication mark code, the tenth processing module includes: the sixteenth processing unit is used for carrying out coding sorting on the bidding data according to the data codes; the second acquisition unit is used for acquiring the collection and storage time of each bidding data; the seventeenth processing unit is used for carrying out time sequencing on the bidding data with the same data codes according to the collection and storage time; and the eighteenth processing unit is used for reserving the bid data with the earlier collection and storage time in the bid data with the same data codes and taking the bid data with the earlier collection and storage time as the bid data after the de-emphasis.
The data deduplication mark code generation system in this embodiment is presented in the form of a functional unit, where the unit refers to an ASIC circuit, a processor and a memory that execute one or more software or fixed programs, and/or other devices that may provide the above-described functionality.
Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.
An embodiment of the present invention further provides an electronic device, as shown in fig. 4, the electronic device includes one or more processors 71 and a memory 72, where one processor 71 is taken as an example in fig. 4.
The controller may further include: an input device 73 and an output device 74.
The processor 71, the memory 72, the input device 73 and the output device 74 may be connected by a bus or other means, as exemplified by the bus connection in fig. 4.
The processor 71 may be a Central Processing Unit (CPU). The Processor 71 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof. A general purpose processor may be a microprocessor or any conventional processor or the like.
The memory 72 is a non-transitory computer readable storage medium, and can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the data deduplication mark code generation method in the embodiment of the present application. The processor 71 executes various functional applications of the server and data processing, namely, a data deduplication mark code generation method of the above-described method embodiment, by running a non-transitory software program, instructions, and modules stored in the memory 72.
The memory 72 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a processing device operated by the server, and the like. Further, the memory 72 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 72 may optionally include memory located remotely from the processor 71, which may be connected to a network connection device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 73 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the processing device of the server. The output device 74 may include a display device such as a display screen.
One or more modules are stored in the memory 72, which when executed by the one or more processors 71 perform the method shown in FIG. 1.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by instructing relevant hardware through a computer program, and the executed program may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the data deduplication mark code generation method described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (10)

1. A method for generating a data deduplication mark code is characterized by comprising the following steps:
acquiring a bidding data set, wherein the bidding data set comprises a plurality of collected bidding data;
obtaining a bidding title, bidding content, a bidding number, a bidding unit name and a bidding stage type of each bidding data according to the bidding data set;
obtaining a title characteristic corresponding to each bidding data according to the bidding title of each bidding data;
obtaining content characteristics corresponding to each bidding data according to the bidding content of each bidding data;
obtaining a numbering characteristic corresponding to each bidding data according to the bidding number of each bidding data;
obtaining unit name characteristics corresponding to each bidding data according to the bidding unit name of each bidding data;
obtaining a stage type characteristic corresponding to each bidding data according to the bidding stage type of each bidding data;
obtaining a data code corresponding to each bidding data according to the title feature, the content feature, the number feature, the unit name feature and the stage type feature of each bidding data;
obtaining a group code corresponding to each bidding data according to the title feature, the content feature, the number feature and the unit name feature of each bidding data;
and obtaining the de-duplication marking code corresponding to each bidding data according to the data coding and the group coding of each bidding data.
2. The method for generating a data deduplication mark code according to claim 1, wherein the step of obtaining the title feature corresponding to each bidding data according to the bidding title of each bidding data includes:
acquiring a preset stage category dictionary;
removing the stage type words in the bidding titles of each bidding data according to a preset stage category dictionary;
segmenting the bidding titles of the removed stage type words to obtain title segments corresponding to each bidding data;
respectively calculating the TFIDF value of each participle in the title participle corresponding to each bidding data;
taking the participles with the TFIDF value higher by a first preset number as title extraction keywords;
and sequencing the title extraction keywords according to a first preset sequence to obtain title sequencing keywords, and taking the title sequencing keywords as title characteristics corresponding to each bidding data.
3. The method for generating data deduplication mark codes according to claim 1, wherein the step of obtaining the content feature corresponding to each bidding data according to the bidding content of each bidding data includes:
respectively segmenting the bidding contents of each bidding data to obtain content segmentation words;
removing stop words in the content participles according to a preset stop word dictionary;
performing word frequency statistics on the content participles with the stop words removed, and taking the content participles with a second preset number and high word frequency as first content keywords;
performing word length sequencing on the content participles with the stop words removed, and taking the content participles with the third preset number and high word length as second content keywords;
using keywords which commonly appear in the first content keywords and the second content keywords as content extraction keywords;
and sequencing the content extraction keywords according to a second preset sequence to obtain content sequencing keywords, and taking the content sequencing keywords as content characteristics corresponding to each bidding data.
4. The method for generating data de-duplication marking codes according to claim 1, wherein the step of obtaining the data code corresponding to each bidding data according to the title feature, the content feature, the number feature, the unit name feature and the stage type feature of each bidding data comprises:
splicing the title feature, the content feature, the number feature, the unit name feature and the stage type feature corresponding to each bidding data according to a first preset splicing sequence to obtain a first splicing feature;
and coding and encrypting the first splicing characteristics to obtain the data code of each bidding data.
5. The method for generating a data deduplication mark code according to claim 1, wherein the step of obtaining the group code corresponding to each bidding data according to the title feature, the content feature, the number feature, and the unit name feature of each bidding data includes:
splicing the title feature, the content feature, the number feature and the unit name feature corresponding to each bidding data according to a second preset splicing sequence to obtain a second splicing feature;
and coding and encrypting the second splicing characteristics to obtain the group code of each bidding data.
6. The method for generating data de-duplication marking codes according to claim 1, wherein the step of obtaining the de-duplication marking code corresponding to each bidding data according to the data encoding and group encoding of each bidding data further comprises:
acquiring a de-duplication requirement;
and carrying out de-emphasis processing on the bidding data according to the de-emphasis demands and the de-emphasis mark codes corresponding to each piece of bidding data to obtain the de-emphasized bidding data.
7. The method for generating a data de-duplication marking code according to claim 6, wherein when the de-duplication requirement is de-duplication according to the data code in the de-duplication marking code, the step of de-duplicating the bid data according to the de-duplication requirement and the de-duplication marking code corresponding to each bid data to obtain the de-duplicated bid data comprises:
coding and sequencing the bidding data according to the data codes;
acquiring the collection and storage time of each bidding data;
carrying out time sequencing on the bidding data with the same data code according to the collection and storage time;
and reserving the bidding data with early collection and storage time in the bidding data with the same data codes, and taking the bidding data with early collection and storage time as the bidding data after de-weighting.
8. A data deduplication mark code generation system, comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a bidding data set which comprises a plurality of acquired bidding data;
the first processing module is used for obtaining the bid title, the bid content, the bid number, the bid unit name and the bid stage type of each bid data according to the bid data set;
the second processing module is used for obtaining the title characteristics corresponding to each bidding data according to the bidding title of each bidding data;
the third processing module is used for obtaining the content characteristics corresponding to each bidding data according to the bidding content of each bidding data;
the fourth processing module is used for obtaining the number characteristic corresponding to each bidding data according to the bidding number of each bidding data;
the fifth processing module is used for obtaining the unit name characteristic corresponding to each bidding data according to the bidding unit name of each bidding data;
the sixth processing module is used for obtaining the stage type characteristics corresponding to each bidding data according to the bidding stage type of each bidding data;
the seventh processing module is used for obtaining a data code corresponding to each bidding data according to the title feature, the content feature, the number feature, the unit name feature and the stage type feature of each bidding data;
the eighth processing module is used for obtaining a group code corresponding to each bidding data according to the title feature, the content feature, the numbering feature and the unit name feature of each bidding data;
and the ninth processing module is used for obtaining the deduplication mark code corresponding to each bidding data according to the data code and the group code of each bidding data.
9. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the data deduplication mark code generation method of any of claims 1-7.
10. A computer-readable storage medium storing computer instructions for causing a computer to perform the data de-duplication flag generation method according to any one of claims 1 to 7.
CN202110996617.3A 2021-08-27 2021-08-27 Data deduplication marking code generation method, system, electronic equipment and storage medium Active CN113627132B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110996617.3A CN113627132B (en) 2021-08-27 2021-08-27 Data deduplication marking code generation method, system, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110996617.3A CN113627132B (en) 2021-08-27 2021-08-27 Data deduplication marking code generation method, system, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113627132A true CN113627132A (en) 2021-11-09
CN113627132B CN113627132B (en) 2024-04-02

Family

ID=78388167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110996617.3A Active CN113627132B (en) 2021-08-27 2021-08-27 Data deduplication marking code generation method, system, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113627132B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114943593A (en) * 2022-07-26 2022-08-26 北京拓普丰联信息科技股份有限公司 Method and device for merging beacon information, electronic equipment and storage medium
CN116860833A (en) * 2023-07-18 2023-10-10 深圳交易集团有限公司 Main body information service system of multi-domain data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090271454A1 (en) * 2008-04-29 2009-10-29 International Business Machines Corporation Enhanced method and system for assuring integrity of deduplicated data
CN101788976A (en) * 2010-02-10 2010-07-28 北京播思软件技术有限公司 File splitting method based on contents
CN102063498A (en) * 2010-12-31 2011-05-18 百度在线网络技术(北京)有限公司 Link de-duplication processing method and device based on content and feature information
CN102156727A (en) * 2011-04-01 2011-08-17 华中科技大学 Method for deleting repeated data by using double-fingerprint hash check
CN105612489A (en) * 2014-09-15 2016-05-25 华为技术有限公司 Data duplication method and storage array
CN110196848A (en) * 2019-04-09 2019-09-03 广联达科技股份有限公司 A kind of cleaning De-weight method and its system towards public resource transaction data
CN112632054A (en) * 2020-12-30 2021-04-09 南京翼海云峰软件技术有限公司 Data set duplication removing method based on attribute encryption, storage medium and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090271454A1 (en) * 2008-04-29 2009-10-29 International Business Machines Corporation Enhanced method and system for assuring integrity of deduplicated data
CN101788976A (en) * 2010-02-10 2010-07-28 北京播思软件技术有限公司 File splitting method based on contents
CN102063498A (en) * 2010-12-31 2011-05-18 百度在线网络技术(北京)有限公司 Link de-duplication processing method and device based on content and feature information
CN102156727A (en) * 2011-04-01 2011-08-17 华中科技大学 Method for deleting repeated data by using double-fingerprint hash check
CN105612489A (en) * 2014-09-15 2016-05-25 华为技术有限公司 Data duplication method and storage array
CN110196848A (en) * 2019-04-09 2019-09-03 广联达科技股份有限公司 A kind of cleaning De-weight method and its system towards public resource transaction data
CN112632054A (en) * 2020-12-30 2021-04-09 南京翼海云峰软件技术有限公司 Data set duplication removing method based on attribute encryption, storage medium and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HIMSHAI KAMBO 等: "Secure Data Depuplication Mechanism based on Rabin CDC and MD5 in Cloud Computing Environment", 《RTEICT》, pages 400 - 404 *
NIRMALA BHADRAPPA 等: "Implementation of De-Deplication Algorithm", 《IRJET》, pages 84 - 88 *
刘翰 等: "电子照片平台安全加速系统关键技术与实现", 《技术与创新管理》, pages 412 - 417 *
黄奇鹏 等: "海量关系数据去重处理技术研究与优化", 《计算机与数字工程》, pages 2061 - 2065 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114943593A (en) * 2022-07-26 2022-08-26 北京拓普丰联信息科技股份有限公司 Method and device for merging beacon information, electronic equipment and storage medium
CN116860833A (en) * 2023-07-18 2023-10-10 深圳交易集团有限公司 Main body information service system of multi-domain data
CN116860833B (en) * 2023-07-18 2024-04-16 深圳交易集团有限公司 Main body information service system of multi-domain data

Also Published As

Publication number Publication date
CN113627132B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
WO2019174132A1 (en) Data processing method, server and computer storage medium
CN113157927B (en) Text classification method, apparatus, electronic device and readable storage medium
CN103473263B (en) News event development process-oriented visual display method
CN113051356A (en) Open relationship extraction method and device, electronic equipment and storage medium
CN113627132A (en) Data deduplication mark code generation method and system, electronic device and storage medium
CN112541125B (en) Sequence annotation model training method and device and electronic equipment
CN105550359B (en) Webpage sorting method and device based on vertical search and server
CN110956021A (en) Original article generation method, device, system and server
CN112925883B (en) Search request processing method and device, electronic equipment and readable storage medium
CN113408660B (en) Book clustering method, device, equipment and storage medium
CN111797247B (en) Case pushing method and device based on artificial intelligence, electronic equipment and medium
CN105589894A (en) Document index establishing method and device as well as document retrieving method and device
CN112487293A (en) Method, device and medium for extracting safety accident case structured information
CN112232075A (en) Article release time identification method based on time format and webpage element characteristics
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN107169011B (en) Webpage originality identification method and device based on artificial intelligence and storage medium
CN111639250A (en) Enterprise description information acquisition method and device, electronic equipment and storage medium
CN113435308B (en) Text multi-label classification method, device, equipment and storage medium
CN106033444B (en) Text content clustering method and device
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN110674283B (en) Intelligent extraction method and device for text abstracts, computer equipment and storage medium
CN108776705B (en) Text full-text accurate query method, device, equipment and readable medium
CN111160445A (en) Bid document similarity calculation method and device
CN110674286A (en) Text abstract extraction method and device and storage equipment
CN107145947B (en) Information processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240305

Address after: Building 16, Big Data Industrial Park, No. 868 Qinghe Road, Luyang Economic and Technological Development Zone, Hefei City, Anhui Province, 230000

Applicant after: Smart Starlight (Anhui) Technology Co.,Ltd.

Country or region after: China

Address before: 100080 area a, 22 / F, block a, 8 Haidian Street, Haidian District, Beijing

Applicant before: BEIJING SMART STARLIGHT INFORMATION TECHNOLOGY CO.,LTD.

Country or region before: China

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant