CN112115236A

CN112115236A - Method and device for constructing tobacco scientific and technical literature data deduplication model

Info

Publication number: CN112115236A
Application number: CN202011070240.0A
Authority: CN
Inventors: 闫爱华; 张胜华; 唐敏; 李琳; 黎根; 赵鹏
Original assignee: China Tobacco Hubei Industrial LLC
Current assignee: China Tobacco Hubei Industrial LLC
Priority date: 2020-10-09
Filing date: 2020-10-09
Publication date: 2020-12-22
Anticipated expiration: 2040-10-09
Also published as: CN112115236B

Abstract

The embodiment of the invention provides a method and a device for constructing a tobacco scientific and technical literature data deduplication model, wherein the method comprises the following steps: acquiring literature data of tobacco science and technology, and removing the duplicate of the literature data to obtain an original data set after the duplicate is removed; sampling first literature data from an original data set, pairing the first literature data, and constructing a duplicate removal data set according to a pairing result; sampling second literature data from the original data set, acquiring a synonymy matching standard of the literature data, converting the second literature data into third literature data through the synonymy matching standard, and establishing a repeated data set through pairing of the second literature data and the third literature data; and performing model training on the duplicate removal data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology. The method is adopted to construct the de-weight model which can more efficiently remove the weight of the tobacco scientific and technological literature data.

Description

Method and device for constructing tobacco scientific and technical literature data deduplication model

Technical Field

The invention relates to the technical field of tobacco scientific and technological data processing, in particular to a method and a device for constructing a tobacco scientific and technological literature data deduplication model.

Background

In recent years, with the deepening of information construction in the tobacco science and technology field, a great amount of tobacco science and technology literature data are accumulated by the gradually-increasing tobacco related departments; however, due to the lack of unified standards among systems of all departments, all enterprises and all business units participating in tobacco information-based construction, the data quality of tobacco scientific and technical literature information also faces a great challenge.

For the tobacco scientific and technical literature data, one of the main data quality problems is the duplication of data (repeatedly recorded tobacco scientific and technical literature data, tobacco scientific and technical literature data organized according to different format specifications, and the like), and in order to better utilize the literature data in the tobacco scientific and technical field, an effective data deduplication method is needed to process the literature data in the tobacco scientific and technical field.

Currently, there are some conventional methods for performing data deduplication on literature data records, which include repeated determination by using literature data record ID comparison, keyword list and summary information comparison, and the like. However, the conventional method belongs to a universal repeated judgment method, is used in the field of tobacco science and technology literature, has no field pertinence, and is not high in repeated judgment efficiency.

Disclosure of Invention

Aiming at the problems in the prior art, the embodiment of the invention provides a method for constructing a tobacco science and technology literature data deduplication model.

The embodiment of the invention provides a method for constructing a tobacco scientific and technical literature data deduplication model, which comprises the following steps:

acquiring literature data of tobacco science and technology, and removing the weight of the literature data to obtain an original data set after the weight is removed;

sampling first literature data from the original data set, pairing the first literature data, and constructing a duplicate removal data set according to the pairing result;

sampling second literature data from the original data set, acquiring a synonymous matching standard of the literature data, converting the second literature data into third literature data through the synonymous matching standard, and establishing a repeated data set through pairing the second literature data and the third literature data;

and performing model training on the duplicate data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology.

In one embodiment, the method further comprises:

and acquiring the keywords in the second literature data, and replacing the keywords with synonyms to obtain the third literature data.

In one embodiment, the method further comprises:

detecting whether the data volume of the duplicate data set and the duplicate data set reaches a preset data volume standard or not;

and when the data quantity of the duplicate removal data set and the data quantity of the duplicate data set do not reach the preset data quantity standard, repeating the steps of constructing the duplicate removal data set and the duplicate data set until the data quantity of the duplicate removal data set and the data quantity of the duplicate data set reach the preset data quantity standard.

In one embodiment, the method further comprises:

and acquiring a preset data weight table, and distributing weights to the data in the original data set according to the data weight table to obtain the original data set after weight distribution, wherein the weights of the original data set are used for adjusting the sampling probability when the original data set is sampled.

In one embodiment, the method further comprises:

receiving data to be deduplicated, and randomly selecting data from the original data set to be paired with the data to be deduplicated to obtain data to be deduplicated;

inputting the data to be deduplicated into the deduplication model to obtain the repetition probability of the data to be deduplicated;

and acquiring a repetition judgment threshold, comparing the repetition probability with the repetition judgment threshold, and judging whether the data to be deduplicated are repeated or not according to a comparison result.

In one embodiment, the method further comprises:

marking the document data in the duplicate data set as 0, and marking the document data in the duplicate data set as 1;

performing model training on the de-duplicated data set and the duplicated data set through a neural network model, wherein the model training comprises:

and performing model training on the literature data and the marks of the de-duplicated data set and the literature data and the marks of the duplicated data set through a neural network model.

The embodiment of the invention provides a device for constructing a tobacco scientific and technical literature data deduplication model, which comprises:

the acquisition module is used for acquiring literature data of tobacco science and technology, and removing the duplicate of the literature data to obtain an original data set after the duplicate is removed;

the device comprises a first sampling module, a second sampling module and a matching module, wherein the first sampling module is used for sampling first literature data from the original data set, matching the first literature data and constructing a duplicate removal data set according to a matching result;

the second sampling module is used for sampling second literature data from the original data set, acquiring a synonymous matching standard of the literature data, converting the second literature data into third literature data through the synonymous matching standard, and establishing a repeated data set through pairing of the second literature data and the third literature data;

and the training module is used for carrying out model training on the duplicate data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology.

In one embodiment, the apparatus further comprises:

and the second acquisition module is used for acquiring the keywords in the second document data and replacing the keywords with synonyms to obtain the third document data.

The embodiment of the invention provides electronic equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the steps of the method for constructing the tobacco scientific and technical literature data deduplication model.

An embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method for constructing the deduplication model of the tobacco scientific literature data.

According to the method and the device for constructing the tobacco science and technology literature data deduplication model, the literature data of the tobacco science and technology is acquired, and deduplication is performed on the literature data to obtain an original data set after deduplication; sampling first literature data from an original data set, pairing the first literature data, and constructing a duplicate removal data set according to a pairing result; sampling second literature data from the original data set, acquiring a synonymy matching standard of the literature data, converting the second literature data into third literature data through the synonymy matching standard, and establishing a repeated data set through pairing of the second literature data and the third literature data; and performing model training on the duplicate removal data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology. The method for constructing the de-duplication model for the tobacco science and technology literature data is provided, so that the follow-up de-duplication for the tobacco science and technology literature data can be carried out more efficiently.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a method for constructing a deduplication model of tobacco scientific literature data according to an embodiment of the present invention;

FIG. 2 is a block diagram of a device for constructing a de-duplication model of tobacco scientific and technical literature data according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of a method for constructing a tobacco scientific and technical literature data deduplication model according to an embodiment of the present invention, and as shown in fig. 1, an embodiment of the present invention provides a method for constructing a tobacco scientific and technical literature data deduplication model, including:

and S101, acquiring literature data of tobacco science and technology, and removing the duplicate of the literature data to obtain an original data set after the duplicate is removed.

Specifically, document data used for being constructed as a deduplication model is acquired from a database of tobacco science and technology document data, and the acquired document data is deduplicated, wherein deduplication can be that identical data in the document data is deleted, so that document data without duplicated data is obtained, and the document data is recorded as an original data set.

Step S102, sampling first literature data from the original data set, pairing the first literature data, and constructing a duplicate removal data set according to the pairing result.

Specifically, first literature data is randomly sampled from an original data set, the obtained first literature data is paired with each other, for example, when two pieces of first literature data are sampled, two pieces of literature data are matched, and a duplicate removal data set is constructed through the result of pairing.

Step S103, sampling second literature data from the original data set, acquiring a synonymous matching standard of the literature data, converting the second literature data into third literature data through the synonymous matching standard, and establishing a repeated data set through pairing the second literature data and the third literature data.

Specifically, second literature data is obtained by random sampling from an original data set, then synonymy matching standard is carried out on the second literature data to convert the second literature data into third literature data, and the second literature data and the third literature data before and after synonymy conversion are paired to obtain a repeated data set.

In addition, the synonymy matching criterion may be that a keyword in the second literature data is obtained, and then the keyword is converted into a synonym through the synonym in the database of the tobacco technology, so as to obtain third literature data, wherein the keyword selection range may be: a) tobacco section of "Chinese Classification topic word List"; b) tobacco part of "Chinese subject word list"; c) the tobacco part of the International convention of Commodity name and code coordination system; d) statement of official documents in tobacco industry; e) related national standards and industry standards of the tobacco industry; f) tobacco industry related enterprises, research institutions, communities, public published noun terms, and the like.

And S104, performing model training on the duplicate data sets and the duplicate data sets through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology.

Specifically, the deduplication data and the duplicate data set obtained in the above steps are trained through a neural network model, and the specific training steps may be, for example: the method comprises the steps of marking document data in a duplicate data set as 0, marking document data in a duplicate data set as 1, then constructing a twin neural network model for judging whether two document data records are duplicate records, wherein the input of the neural network model is the document data records with the labels of 0 and 1 respectively, and the numerical value between the output result and the value [0,1] represents the probability that the two input document records are the duplicate data records.

According to the method for constructing the tobacco science and technology literature data deduplication model, the literature data of the tobacco science and technology is obtained, deduplication is performed on the literature data, and an original data set after deduplication is obtained; sampling first literature data from an original data set, pairing the first literature data, and constructing a duplicate removal data set according to a pairing result; sampling second literature data from the original data set, acquiring a synonymy matching standard of the literature data, converting the second literature data into third literature data through the synonymy matching standard, and establishing a repeated data set through pairing of the second literature data and the third literature data; and performing model training on the duplicate removal data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology. The method for constructing the de-duplication model for the tobacco science and technology literature data is provided, so that the follow-up de-duplication for the tobacco science and technology literature data can be carried out more efficiently.

On the basis of the above embodiment, the method for constructing the tobacco science and technology literature data deduplication model further includes:

In the embodiment of the present invention, before performing model training on the deduplication data set and the duplicate data set through the neural network model, it may be further detected whether the data volumes of the deduplication data set and the duplicate data set reach a preset data volume standard, for example, when it is detected that the data volumes are not enough to train a model with high reliability, the steps of constructing the deduplication data set and the duplicate data set are repeated until the data volumes of the deduplication data set and the duplicate data set reach the preset data volume standard, that is, when a model with high reliability can be constructed, model training is performed through the neural model.

According to the embodiment of the invention, whether a model with high reliability can be constructed is detected, and when the data volume reaches the standard, the model is constructed, so that the reliability of the model is ensured.

In the embodiment of the present invention, a preset data weight table is obtained, where the data weight table may assign weights according to keywords in document data, for example, the weights of the keywords with high attention and high occurrence frequency are high, and the weights of the keywords with low attention and low occurrence frequency are low, and then assign the weights of the document data according to the weights of the keywords, where the weight of an original data set may affect a sampling probability when the document data is sampled from the original data set, for example, the greater the weight of the document data is, the higher the probability of being sampled is.

According to the embodiment of the invention, the sampling probability of the document data is adjusted through the weight, so that the document data with larger weight can be more targeted in the subsequent model construction, and the model is more efficient in weight checking.

In the embodiment of the invention, after a duplication elimination model is constructed, any piece of literature data is randomly selected from an original data set subjected to duplication elimination and is paired with data to be duplicated, paired data to be duplicated are input into the duplication elimination model, the duplication probability of the data to be duplicated is calculated, the duplication probability is compared with a duplication judgment threshold, and according to the comparison result, when the duplication probability is greater than the duplication judgment threshold, the duplication of the data to be duplicated is judged.

In addition, when the document data to be deduplicated and any piece of document data extracted from the original data set that has been deduplicated are both judged not to be duplicated, the document data to be deduplicated is marked as not duplicated and added to the deduplicated document database.

In another embodiment, the method for constructing the tobacco science and technology literature data deduplication model may include the following steps:

and (1.1) collecting and sorting the existing tobacco scientific and technical literature data, performing deduplication processing by adopting a labeling method, and obtaining a tobacco scientific and technical literature data set without repeated data after the step, wherein the tobacco scientific and technical literature data set is marked as an original data set.

Step (1.2), according to the original data set obtained in the step (1.1), obtaining a deduplication training set in the following way: (1.2. a), starting with an empty deduplication training set; (1.2.B) randomly sampling two document data records from the original data set, marking the record pair as non-repeated (value 0), and adding the data pair and the marking result into the duplicate removal data set; (1.2. C), randomly sampling a document data record from the original data set, replacing keywords containing the tobacco science and technology field in the content of the document data with synonyms or synonyms according to a certain probability (less than 30%) to obtain a new document data, marking the document data record pairs of the two days as repetition (value 1), and adding the data pairs and the marking result into the deduplication training set; (1.2. D), repeating (1.2.B) or (1.2. C) until the number of records marked as repeated and non-repeated in the deduplication training set reaches the model training requirement.

Step (1.3), a twin neural network model for judging whether the two literature data records are repeated records is constructed, the input of the neural network model is the two literature data records, and the numerical value between the output results of [0,1] represents the probability that the two input literature records are repeated data records;

and (1.4) training the neural network model constructed in the step (1.3) by using the de-emphasis training set generated in the step (1.2), and storing the model of the training number, which is recorded as a tobacco science and technology literature data de-emphasis model.

Fig. 2 is a device for constructing a tobacco science and technology literature data deduplication model according to an embodiment of the present invention, including: an acquisition module 201, a first sampling module 202, a second sampling module 203, and a training module 204, wherein:

the acquisition module 201 is configured to acquire literature data of tobacco science and technology, and perform deduplication on the literature data to obtain a deduplicated original data set.

The first sampling module 202 is configured to sample first document data from an original data set, pair the first document data, and construct a deduplication data set according to a result of the pairing.

The second sampling module 203 samples second literature data from the original data set, obtains a synonymous matching standard of the literature data, converts the second literature data into third literature data through the synonymous matching standard, and constructs a repeated data set through pairing of the second literature data and the third literature data.

And the training module 204 is configured to perform model training on the duplicate removal data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology.

In one embodiment, the apparatus may further comprise:

and the second acquisition module is used for acquiring the keywords in the second document data and replacing the keywords with synonyms to obtain third document data.

In one embodiment, the apparatus may further comprise:

the detection module is used for detecting whether the data volume of the duplicate data set and the duplicate data set reaches a preset data volume standard or not;

and the repeated sampling module is used for repeating the steps of constructing the duplicate removal data set and the repeated data set until the data amount of the duplicate removal data set and the repeated data set reaches the preset data amount standard when the data amount of the duplicate removal data set and the repeated data set does not reach the preset data amount standard.

In one embodiment, the apparatus may further comprise:

and the third acquisition module is used for acquiring a preset data weight table, distributing weights to the data in the original data set according to the data weight table to obtain the original data set after weight distribution, wherein the weights of the original data set are used for adjusting the sampling probability when the original data set is sampled.

In one embodiment, the apparatus may further comprise:

and the receiving module is used for receiving the data to be deduplicated and randomly selecting data from the original data set to be paired with the data to be deduplicated to obtain the data to be deduplicated.

And the input module is used for inputting the data to be deduplicated into the deduplication model to obtain the repetition probability of the data to be deduplicated.

And the fourth acquisition module is used for acquiring the repeated judgment threshold, comparing the repeated probability with the repeated judgment threshold and judging whether the data to be deduplicated are repeated or not according to the comparison result.

In one embodiment, the apparatus may further comprise:

and the marking module is used for marking the document data in the duplicate data set as 0 and marking the document data in the duplicate data set as 1.

And the second training module is used for performing model training on the literature data and the marks of the de-duplicated data set and the literature data and the marks of the repeated data set through a neural network model.

For specific limitations of the device for constructing the tobacco science and technology literature data deduplication model, reference may be made to the above limitations on the method for constructing the tobacco science and technology literature data deduplication model, and details are not repeated here. All or part of each module in the device for constructing the tobacco science and technology literature data deduplication model can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

Fig. 3 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 3: a processor (processor)301, a memory (memory)302, a communication Interface (Communications Interface)303 and a communication bus 304, wherein the processor 301, the memory 302 and the communication Interface 303 complete communication with each other through the communication bus 304. The processor 301 may call logic instructions in the memory 302 to perform the following method: acquiring literature data of tobacco science and technology, and removing the weight of the literature data to obtain an original data set after the weight is removed; sampling first literature data from the original data set, pairing the first literature data, and constructing a duplicate removal data set according to the pairing result; sampling second literature data from the original data set, acquiring a synonymous matching standard of the literature data, converting the second literature data into third literature data through the synonymous matching standard, and establishing a repeated data set through pairing the second literature data and the third literature data; and performing model training on the duplicate data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology.

Furthermore, the logic instructions in the memory 302 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, and for example, the method includes: acquiring literature data of tobacco science and technology, and removing the weight of the literature data to obtain an original data set after the weight is removed; sampling first literature data from the original data set, pairing the first literature data, and constructing a duplicate removal data set according to the pairing result; sampling second literature data from the original data set, acquiring a synonymous matching standard of the literature data, converting the second literature data into third literature data through the synonymous matching standard, and establishing a repeated data set through pairing the second literature data and the third literature data; and performing model training on the duplicate data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A construction method of a tobacco science and technology literature data deduplication model is characterized by comprising the following steps:

2. The method for constructing the tobacco science and technology literature data deduplication model according to claim 1, wherein the converting the second literature data into third literature data through the synonymous matching criterion comprises:

3. The method for constructing a tobacco science and technology literature data deduplication model according to claim 1, wherein before the model training of the deduplication data set and the duplication data set through a neural network model, the method further comprises:

4. The method for constructing a tobacco science and technology literature data deduplication model according to claim 1, wherein after obtaining the deduplicated original data set, the method further comprises:

5. The method for constructing the tobacco science and technology literature data deduplication model according to claim 1, wherein after obtaining the deduplication model of the tobacco science and technology literature data, the method further comprises:

6. The method for constructing the tobacco science and technology literature data deduplication model according to claim 1, wherein the method further comprises:

7. A tobacco science and technology literature data deduplication model construction device is characterized by comprising:

8. The apparatus for constructing a tobacco technology literature data deduplication model according to claim 7, wherein the apparatus further comprises:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for constructing a de-duplication model of tobacco technology literature data according to any one of claims 1 to 6 when executing the program.

10. A non-transitory computer readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the method for constructing a tobacco technology literature data deduplication model according to any one of claims 1 to 6.