CN112115236B

CN112115236B - Construction method and device of tobacco science and technology literature data deduplication model

Info

Publication number: CN112115236B
Application number: CN202011070240.0A
Authority: CN
Inventors: 闫爱华; 张胜华; 唐敏; 李琳; 黎根; 赵鹏
Original assignee: China Tobacco Hubei Industrial LLC
Current assignee: China Tobacco Hubei Industrial LLC
Priority date: 2020-10-09
Filing date: 2020-10-09
Publication date: 2024-02-02
Anticipated expiration: 2040-10-09
Also published as: CN112115236A

Abstract

The embodiment of the invention provides a construction method and a construction device of a tobacco science and technology literature data deduplication model, wherein the method comprises the following steps: obtaining literature data of tobacco science and technology, and performing duplication removal on the literature data to obtain an original data set after duplication removal; sampling first literature data from an original data set, pairing the first literature data, and constructing a deduplication data set through the paired result; sampling second literature data from the original data set, acquiring a synonym matching standard of the literature data, converting the second literature data into third literature data through the synonym matching standard, and constructing a repeated data set through pairing the second literature data and the third literature data; and carrying out model training on the deduplication data set and the repeated data set through a neural network model to obtain a deduplication model of literature data of the tobacco science and technology. By adopting the method, a duplication elimination model which can eliminate duplication aiming at tobacco science and technology literature data can be constructed more efficiently.

Description

Construction method and device of tobacco science and technology literature data deduplication model

Technical Field

The invention relates to the technical field of tobacco science and technology data processing, in particular to a construction method and device of a tobacco science and technology literature data deduplication model.

Background

In recent years, along with the continuous deepening of informatization construction in the tobacco science and technology field, a great amount of tobacco science and technology literature data is accumulated in the gradually-growing tobacco related departments; however, due to the lack of unified standards among systems of departments, enterprises and business units participating in the informatization construction of tobacco, the data quality of tobacco science and literature information also faces great challenges.

One of the major data quality problems for tobacco technical literature data is the duplication of data (repeatedly recorded tobacco technical literature data, tobacco technical literature data organized according to different format specifications, etc.), and in order to better utilize the tobacco technical literature data, an effective data deduplication method is required to process the tobacco technical literature data.

Currently, there are some conventional methods for performing data deduplication on a document data record, which include repeated judgment using document data record ID comparison, keyword list, and summary information comparison, and the like. However, the conventional method belongs to a common repeated judging method, is used in the field of tobacco science and technology literature, has no pertinence in the field, and has low repeated judging efficiency.

Disclosure of Invention

Aiming at the problems existing in the prior art, the embodiment of the invention provides a construction method of a tobacco science and technology literature data deduplication model.

The embodiment of the invention provides a construction method of a tobacco science and technology literature data deduplication model, which comprises the following steps:

obtaining literature data of tobacco science and technology, and performing duplication removal on the literature data to obtain an original data set after duplication removal;

sampling first literature data from the original data set, pairing the first literature data, and constructing a deduplication data set through the pairing result;

sampling second literature data from the original data set, acquiring a synonym matching standard of the literature data, converting the second literature data into third literature data through the synonym matching standard, and constructing a repeated data set through pairing the second literature data and the third literature data;

and carrying out model training on the deduplication data set and the repeated data set through a neural network model to obtain a deduplication model of the literature data of the tobacco science and technology.

In one embodiment, the method further comprises:

and obtaining keywords in the second literature data, and replacing the keywords with synonyms to obtain the third literature data.

In one embodiment, the method further comprises:

detecting whether the data volume of the duplicate data set and the duplicate data set reaches a preset data volume standard;

and when the data volume of the de-duplication data set and the repeated data set does not reach the preset data volume standard, repeating the steps of constructing the de-duplication data set and constructing the repeated data set until the data volumes of the de-duplication data set and the repeated data set reach the preset data volume standard.

In one embodiment, the method further comprises:

and acquiring a preset data weight table, distributing weights for data in the original data set according to the data weight table, and obtaining an original data set with the distributed weights, wherein the weights of the original data set are used for adjusting sampling probability when the original data set is sampled.

In one embodiment, the method further comprises:

receiving data to be de-duplicated, and randomly selecting data from the original data set to be paired with the data to be de-duplicated to obtain paired data to be de-duplicated;

inputting the paired data to be deduplicated into the deduplication model to obtain the repetition probability of the paired data to be deduplicated;

and acquiring a repetition judgment threshold, comparing the repetition probability with the repetition judgment threshold, and judging whether the data to be de-duplicated is repeated or not according to a comparison result.

In one embodiment, the method further comprises:

marking the document data in the de-duplication dataset as 0 and the document data in the duplication dataset as 1;

the model training of the deduplication dataset and the repetition dataset through a neural network model comprises:

model training is performed on the literature data and the labels of the duplicate dataset through a neural network model.

The embodiment of the invention provides a construction device of a tobacco science and technology literature data deduplication model, which comprises the following steps:

the acquisition module is used for acquiring literature data of tobacco science and technology, and performing deduplication on the literature data to obtain an original data set after deduplication;

the first sampling module is used for sampling first literature data from the original data set, pairing the first literature data and constructing a deduplication data set according to the pairing result;

the second sampling module is used for sampling second literature data from the original data set, acquiring a synonymous matching standard of the literature data, converting the second literature data into third literature data through the synonymous matching standard, and constructing a repeated data set through pairing the second literature data and the third literature data;

and the training module is used for carrying out model training on the deduplication data set and the repeated data set through a neural network model to obtain a deduplication model of the literature data of the tobacco science and technology.

In one embodiment, the apparatus further comprises:

and the second acquisition module is used for acquiring the keywords in the second document data and replacing the keywords with synonyms to obtain the third document data.

The embodiment of the invention provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the construction method of the tobacco science and technology literature data deduplication model when executing the program.

An embodiment of the present invention provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for constructing a tobacco science and literature data deduplication model described above.

According to the construction method and the construction device for the tobacco science and technology literature data deduplication model, literature data of tobacco science and technology are obtained, deduplication is carried out on the literature data, and an original data set after deduplication is obtained; sampling first literature data from an original data set, pairing the first literature data, and constructing a deduplication data set through the paired result; sampling second literature data from the original data set, acquiring a synonym matching standard of the literature data, converting the second literature data into third literature data through the synonym matching standard, and constructing a repeated data set through pairing the second literature data and the third literature data; and carrying out model training on the deduplication data set and the repeated data set through a neural network model to obtain a deduplication model of literature data of the tobacco science and technology. The construction method for constructing the deduplication model aiming at the tobacco science and technology literature data is convenient for subsequent deduplication aiming at the tobacco science and technology literature data with higher efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for constructing a tobacco science and technology literature data deduplication model in an embodiment of the invention;

FIG. 2 is a block diagram of a construction device of a tobacco science and technology literature data deduplication model in an embodiment of the invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flow chart of a method for constructing a tobacco science and technology literature data duplication elimination model according to an embodiment of the invention, and as shown in fig. 1, the embodiment of the invention provides a method for constructing a tobacco science and technology literature data duplication elimination model, which includes:

step S101, obtaining literature data of tobacco science and technology, and performing duplication removal on the literature data to obtain an original data set after duplication removal.

Specifically, literature data used for construction as a deduplication model is obtained from a database of tobacco science and technology literature data, and the obtained literature data is deduplicated, wherein deduplication can be that identical data in the literature data are deleted, so that literature data without duplicate data are obtained, and the literature data are recorded as an original data set.

Step S102, sampling first literature data from the original data set, pairing the first literature data, and constructing a deduplication data set through the pairing result.

Specifically, first document data is randomly sampled from an original data set, the obtained first document data are paired with each other, for example, when two pieces of first document data are sampled, two pieces of document data are matched, and a deduplication data set is constructed through the paired result.

And step S103, sampling second literature data from the original data set, acquiring a synonymous matching standard of the literature data, converting the second literature data into third literature data through the synonymous matching standard, and constructing a repeated data set through pairing the second literature data and the third literature data.

Specifically, second document data is randomly sampled from an original data set, then the second document data is converted into third document data by carrying out synonymous matching standard on the second document data, and the second document data and the third document data before and after synonymous conversion are paired to obtain a repeated data set.

In addition, the synonym matching criterion may be that keywords in the second document data are obtained, and then the keywords are converted into synonyms through synonyms in a database of tobacco technologies, so as to obtain third document data, where the keyword selection range may be: a) Tobacco part of Chinese classification subject vocabulary; b) Tobacco part of Chinese subject vocabulary; c) The tobacco section of International convention of trade name and code coordination System; d) The tobacco industry official document subject matter word list; e) Tobacco industry related national and industry standards; f) Tobacco industry related enterprises, research institutions, communities, publicly disclosed noun terms, and the like.

And step S104, carrying out model training on the deduplication data set and the repeated data set through a neural network model to obtain a deduplication model of the literature data of the tobacco technology.

Specifically, the de-duplication data and the repeated data set obtained in the above steps are trained through a neural network model, and specific training steps may be as follows: the method comprises the steps of marking literature data in a deduplication data set as 0, marking literature data in a repetition data set as 1, and then constructing a twin neural network model for judging whether two literature data records are repetition records, wherein the input of the neural network model is the two literature data records with the marks of 0 and 1 respectively, and the output result is a numerical value between 0 and 1 and represents the probability that the two input literature records are repetition data records.

According to the construction method of the tobacco science and technology literature data deduplication model, literature data of tobacco science and technology are obtained, deduplication is carried out on the literature data, and an original data set after deduplication is obtained; sampling first literature data from an original data set, pairing the first literature data, and constructing a deduplication data set through the paired result; sampling second literature data from the original data set, acquiring a synonym matching standard of the literature data, converting the second literature data into third literature data through the synonym matching standard, and constructing a repeated data set through pairing the second literature data and the third literature data; and carrying out model training on the deduplication data set and the repeated data set through a neural network model to obtain a deduplication model of literature data of the tobacco science and technology. The construction method for constructing the deduplication model aiming at the tobacco science and technology literature data is convenient for subsequent deduplication aiming at the tobacco science and technology literature data with higher efficiency.

On the basis of the above embodiment, the method for constructing the tobacco science and technology literature data deduplication model further includes:

In the embodiment of the invention, before the de-duplication data set and the repeated data set are subjected to model training through the neural network model, whether the data volume of the de-duplication data set and the repeated data set reaches a preset data volume standard can be detected, for example, when the data volume is detected to be insufficient for training a model with high reliability, the steps of constructing the de-duplication data set and constructing the repeated data set are repeated until the data volumes of the de-duplication data set and the repeated data set reach the preset data volume standard, namely, the model with high reliability can be constructed, and then the model training is performed through the neural model.

According to the embodiment of the invention, whether the data volume can construct a model with high reliability is detected, and when the data volume reaches the standard, the model is constructed, so that the reliability of the model is ensured.

On the basis of the foregoing embodiment, the method for constructing the tobacco science and technology literature data deduplication model further includes:

In the embodiment of the invention, a preset data weight table is obtained, wherein the data weight table can be used for distributing weights according to keywords in document data, for example, keywords with high attention, high occurrence frequency are high in weight, keywords with low attention and low occurrence frequency are low in weight, then the weights of the document data are distributed according to the weights of the keywords, and the weights of the original data set can influence the sampling probability when the document data is sampled from the original data set, for example, the larger the weights of the document data are, the higher the sampled probability is.

According to the method and the device for the model weight adjustment, the sampling probability of the document data is adjusted through the weight, so that the document data with larger weight can be more aimed at in the subsequent model construction, and the model is more efficient in weight checking.

In the embodiment of the invention, after a deduplication model is constructed, randomly selecting any piece of document data from a deduplicated original data set, pairing the document data with the to-be-deduplicated data, inputting paired to-be-deduplicated paired data into the deduplication model, calculating to obtain the repetition probability of the to-be-deduplicated data, comparing the repetition probability with a repeated judgment threshold, and judging that the to-be-deduplicated data is repeated when the repetition probability is larger than the repeated judgment threshold according to the comparison result.

In addition, when any of the document data to be deduplicated and the document data extracted from the original data set that has been deduplicated is determined to be non-duplicate, the document data to be deduplicated will be marked as non-duplicate and added to the deduplicated document database.

In another embodiment, the steps of the construction method of the tobacco science and technology literature data deduplication model can be as follows:

and (1.1) collecting and sorting the existing tobacco technical literature data, performing de-duplication treatment by adopting a labeling method, and obtaining a tobacco technical literature data set without repeated data after the step, and marking the tobacco technical literature data set as an original data set.

Step (1.2), obtaining a deduplication training set according to the original data set obtained in the step (1.1) by the following way: (1.2. A) starting from an empty deduplication training set; (1.2. B) randomly sampling two document data records from the original dataset and marking the record pair as non-duplicate (value 0), adding the data pair and the marked result to the deduplication dataset; randomly sampling a document data record from an original data set, replacing keywords containing the tobacco science and technology field in the content of the document data with synonyms or paraphrasing words according to a certain probability (less than 30%) to obtain new document data, marking pairs of the document data record for two days as repetition (value 1), and adding the pairs of data and marking results into a de-duplication training set; (1.2. D), repeat (1.2. B) or (1.2. C) until the number of records in the deduplication training set that are marked as duplicate and non-duplicate meets the model training requirements.

Step (1.3), constructing a twin neural network model for judging whether two document data records are repeated records, wherein the input of the neural network model is the two document data records, and the output result is a value between 0 and 1 and represents the probability of the two input document records being repeated data records;

and (1.4) training the neural network model constructed in the step (1.3) by using the deduplication training set generated in the step (1.2), and storing a model of a training number, and recording the model as a deduplication model of tobacco science and technology literature data.

Fig. 2 is a construction apparatus of a tobacco science and technology literature data deduplication model provided by an embodiment of the present invention, including: an acquisition module 201, a first sampling module 202, a second sampling module 203, and a training module 204, wherein:

the obtaining module 201 is configured to obtain literature data of the tobacco technology, and perform deduplication on the literature data to obtain an original data set after deduplication.

A first sampling module 202, configured to sample the first document data from the original data set, pair the first document data, and construct a deduplication data set according to the paired result.

The second sampling module 203 samples the second document data from the original data set, acquires a synonym matching criterion of the document data, converts the second document data into third document data according to the synonym matching criterion, and constructs a duplicate data set according to pairing between the second document data and the third document data.

The training module 204 is configured to perform model training on the deduplication dataset and the repetition dataset through a neural network model, so as to obtain a deduplication model of literature data of tobacco science and technology.

In one embodiment, the apparatus may further include:

and the second acquisition module is used for acquiring keywords in the second document data and replacing the keywords with synonyms to obtain third document data.

In one embodiment, the apparatus may further include:

the detection module is used for detecting whether the data volume of the duplicate data set and the duplicate data set reaches a preset data volume standard or not;

and the repeated sampling module is used for repeating the steps of constructing the deduplication data set and constructing the repeated data set until the data volume of the deduplication data set and the repeated data set reaches the preset data volume standard when the data volume of the deduplication data set and the repeated data set does not reach the preset data volume standard.

In one embodiment, the apparatus may further include:

the third acquisition module is used for acquiring a preset data weight table, distributing weights to the data in the original data set according to the data weight table, obtaining an original data set with the distributed weights, and adjusting the sampling probability of the original data set when the original data set is sampled by the weights of the original data set.

In one embodiment, the apparatus may further include:

the receiving module is used for receiving the data to be de-duplicated, randomly selecting the data from the original data set and pairing the data to be de-duplicated to obtain paired data to be de-duplicated.

And the input module is used for inputting the paired data to be deduplicated into the deduplication model to obtain the repetition probability of the data to be deduplicated.

And the fourth acquisition module is used for acquiring a repetition judgment threshold value, comparing the repetition probability with the repetition judgment threshold value, and judging whether the data to be de-duplicated is repeated or not according to the comparison result.

In one embodiment, the apparatus may further include:

and the marking module is used for marking the document data in the duplicate data set as 0 and marking the document data in the duplicate data set as 1.

And the second training module is used for training the model through the neural network model by using the literature data and the marks of the deduplication data set and the literature data and the marks of the repetition data set.

For specific limitations on the construction device of the tobacco science and technology literature data duplication eliminating model, reference may be made to the above limitation on the construction method of the tobacco science and technology literature data duplication eliminating model, and the description thereof will be omitted herein. The modules in the construction device of the tobacco science and technology literature data deduplication model can be fully or partially realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Fig. 3 illustrates a physical schematic diagram of an electronic device, as shown in fig. 3, where the electronic device may include: a processor (processor) 301, a memory (memory) 302, a communication interface (Communications Interface) 303 and a communication bus 304, wherein the processor 301, the memory 302 and the communication interface 303 perform communication with each other through the communication bus 304. The processor 301 may call logic instructions in the memory 302 to perform the following method: obtaining literature data of tobacco science and technology, and performing duplication removal on the literature data to obtain an original data set after duplication removal; sampling first literature data from the original data set, pairing the first literature data, and constructing a deduplication data set through the pairing result; sampling second literature data from the original data set, acquiring a synonym matching standard of the literature data, converting the second literature data into third literature data through the synonym matching standard, and constructing a repeated data set through pairing the second literature data and the third literature data; and carrying out model training on the deduplication data set and the repeated data set through a neural network model to obtain a deduplication model of the literature data of the tobacco science and technology.

Further, the logic instructions in memory 302 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, embodiments of the present invention further provide a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the transmission method provided in the above embodiments, for example, including: obtaining literature data of tobacco science and technology, and performing duplication removal on the literature data to obtain an original data set after duplication removal; sampling first literature data from the original data set, pairing the first literature data, and constructing a deduplication data set through the pairing result; sampling second literature data from the original data set, acquiring a synonym matching standard of the literature data, converting the second literature data into third literature data through the synonym matching standard, and constructing a repeated data set through pairing the second literature data and the third literature data; and carrying out model training on the deduplication data set and the repeated data set through a neural network model to obtain a deduplication model of the literature data of the tobacco science and technology.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The construction method of the tobacco science and technology literature data deduplication model is characterized by comprising the following steps of:

performing model training on the deduplication data set and the repeated data set through a neural network model to obtain a deduplication model of the literature data of the tobacco science and technology;

the converting the second document data into third document data by the synonymous matching criterion includes:

obtaining keywords in the second literature data, and replacing the keywords with synonyms to obtain the third literature data;

the method further comprises the steps of:

2. The method for constructing a tobacco science and technology literature data deduplication model according to claim 1, wherein before model training is performed on the deduplication dataset and the repetition dataset through a neural network model, the method further comprises:

3. The method for constructing a tobacco science and technology literature data deduplication model according to claim 1, wherein after obtaining the deduplicated original dataset, further comprises:

4. The method for constructing a tobacco science and technology literature data deduplication model according to claim 1, wherein after the deduplication model of the tobacco science and technology literature data is obtained, further comprising:

5. A device for constructing a tobacco science and technology literature data deduplication model, which is characterized by comprising:

the training module is used for carrying out model training on the deduplication data set and the repeated data set through a neural network model to obtain a deduplication model of the literature data of the tobacco science and technology;

the second acquisition module is used for acquiring keywords in the second document data and replacing the keywords with synonyms to obtain the third document data;

a marking module for marking the document data in the duplicate dataset as 0 and marking the document data in the duplicate dataset as 1;

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the method of constructing a tobacco science and literature data deduplication model as defined in any one of claims 1 to 4 when the program is executed by the processor.

7. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of constructing a tobacco science and literature data deduplication model as defined in any one of claims 1 to 4.