CN112115236A - Method and device for constructing tobacco scientific and technical literature data deduplication model - Google Patents

Method and device for constructing tobacco scientific and technical literature data deduplication model Download PDF

Info

Publication number
CN112115236A
CN112115236A CN202011070240.0A CN202011070240A CN112115236A CN 112115236 A CN112115236 A CN 112115236A CN 202011070240 A CN202011070240 A CN 202011070240A CN 112115236 A CN112115236 A CN 112115236A
Authority
CN
China
Prior art keywords
data
literature
data set
literature data
duplicate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011070240.0A
Other languages
Chinese (zh)
Other versions
CN112115236B (en
Inventor
闫爱华
张胜华
唐敏
李琳
黎根
赵鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Tobacco Hubei Industrial LLC
Original Assignee
China Tobacco Hubei Industrial LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Tobacco Hubei Industrial LLC filed Critical China Tobacco Hubei Industrial LLC
Priority to CN202011070240.0A priority Critical patent/CN112115236B/en
Publication of CN112115236A publication Critical patent/CN112115236A/en
Application granted granted Critical
Publication of CN112115236B publication Critical patent/CN112115236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method and a device for constructing a tobacco scientific and technical literature data deduplication model, wherein the method comprises the following steps: acquiring literature data of tobacco science and technology, and removing the duplicate of the literature data to obtain an original data set after the duplicate is removed; sampling first literature data from an original data set, pairing the first literature data, and constructing a duplicate removal data set according to a pairing result; sampling second literature data from the original data set, acquiring a synonymy matching standard of the literature data, converting the second literature data into third literature data through the synonymy matching standard, and establishing a repeated data set through pairing of the second literature data and the third literature data; and performing model training on the duplicate removal data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology. The method is adopted to construct the de-weight model which can more efficiently remove the weight of the tobacco scientific and technological literature data.

Description

Method and device for constructing tobacco scientific and technical literature data deduplication model
Technical Field
The invention relates to the technical field of tobacco scientific and technological data processing, in particular to a method and a device for constructing a tobacco scientific and technological literature data deduplication model.
Background
In recent years, with the deepening of information construction in the tobacco science and technology field, a great amount of tobacco science and technology literature data are accumulated by the gradually-increasing tobacco related departments; however, due to the lack of unified standards among systems of all departments, all enterprises and all business units participating in tobacco information-based construction, the data quality of tobacco scientific and technical literature information also faces a great challenge.
For the tobacco scientific and technical literature data, one of the main data quality problems is the duplication of data (repeatedly recorded tobacco scientific and technical literature data, tobacco scientific and technical literature data organized according to different format specifications, and the like), and in order to better utilize the literature data in the tobacco scientific and technical field, an effective data deduplication method is needed to process the literature data in the tobacco scientific and technical field.
Currently, there are some conventional methods for performing data deduplication on literature data records, which include repeated determination by using literature data record ID comparison, keyword list and summary information comparison, and the like. However, the conventional method belongs to a universal repeated judgment method, is used in the field of tobacco science and technology literature, has no field pertinence, and is not high in repeated judgment efficiency.
Disclosure of Invention
Aiming at the problems in the prior art, the embodiment of the invention provides a method for constructing a tobacco science and technology literature data deduplication model.
The embodiment of the invention provides a method for constructing a tobacco scientific and technical literature data deduplication model, which comprises the following steps:
acquiring literature data of tobacco science and technology, and removing the weight of the literature data to obtain an original data set after the weight is removed;
sampling first literature data from the original data set, pairing the first literature data, and constructing a duplicate removal data set according to the pairing result;
sampling second literature data from the original data set, acquiring a synonymous matching standard of the literature data, converting the second literature data into third literature data through the synonymous matching standard, and establishing a repeated data set through pairing the second literature data and the third literature data;
and performing model training on the duplicate data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology.
In one embodiment, the method further comprises:
and acquiring the keywords in the second literature data, and replacing the keywords with synonyms to obtain the third literature data.
In one embodiment, the method further comprises:
detecting whether the data volume of the duplicate data set and the duplicate data set reaches a preset data volume standard or not;
and when the data quantity of the duplicate removal data set and the data quantity of the duplicate data set do not reach the preset data quantity standard, repeating the steps of constructing the duplicate removal data set and the duplicate data set until the data quantity of the duplicate removal data set and the data quantity of the duplicate data set reach the preset data quantity standard.
In one embodiment, the method further comprises:
and acquiring a preset data weight table, and distributing weights to the data in the original data set according to the data weight table to obtain the original data set after weight distribution, wherein the weights of the original data set are used for adjusting the sampling probability when the original data set is sampled.
In one embodiment, the method further comprises:
receiving data to be deduplicated, and randomly selecting data from the original data set to be paired with the data to be deduplicated to obtain data to be deduplicated;
inputting the data to be deduplicated into the deduplication model to obtain the repetition probability of the data to be deduplicated;
and acquiring a repetition judgment threshold, comparing the repetition probability with the repetition judgment threshold, and judging whether the data to be deduplicated are repeated or not according to a comparison result.
In one embodiment, the method further comprises:
marking the document data in the duplicate data set as 0, and marking the document data in the duplicate data set as 1;
performing model training on the de-duplicated data set and the duplicated data set through a neural network model, wherein the model training comprises:
and performing model training on the literature data and the marks of the de-duplicated data set and the literature data and the marks of the duplicated data set through a neural network model.
The embodiment of the invention provides a device for constructing a tobacco scientific and technical literature data deduplication model, which comprises:
the acquisition module is used for acquiring literature data of tobacco science and technology, and removing the duplicate of the literature data to obtain an original data set after the duplicate is removed;
the device comprises a first sampling module, a second sampling module and a matching module, wherein the first sampling module is used for sampling first literature data from the original data set, matching the first literature data and constructing a duplicate removal data set according to a matching result;
the second sampling module is used for sampling second literature data from the original data set, acquiring a synonymous matching standard of the literature data, converting the second literature data into third literature data through the synonymous matching standard, and establishing a repeated data set through pairing of the second literature data and the third literature data;
and the training module is used for carrying out model training on the duplicate data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology.
In one embodiment, the apparatus further comprises:
and the second acquisition module is used for acquiring the keywords in the second document data and replacing the keywords with synonyms to obtain the third document data.
The embodiment of the invention provides electronic equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the steps of the method for constructing the tobacco scientific and technical literature data deduplication model.
An embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method for constructing the deduplication model of the tobacco scientific literature data.
According to the method and the device for constructing the tobacco science and technology literature data deduplication model, the literature data of the tobacco science and technology is acquired, and deduplication is performed on the literature data to obtain an original data set after deduplication; sampling first literature data from an original data set, pairing the first literature data, and constructing a duplicate removal data set according to a pairing result; sampling second literature data from the original data set, acquiring a synonymy matching standard of the literature data, converting the second literature data into third literature data through the synonymy matching standard, and establishing a repeated data set through pairing of the second literature data and the third literature data; and performing model training on the duplicate removal data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology. The method for constructing the de-duplication model for the tobacco science and technology literature data is provided, so that the follow-up de-duplication for the tobacco science and technology literature data can be carried out more efficiently.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a method for constructing a deduplication model of tobacco scientific literature data according to an embodiment of the present invention;
FIG. 2 is a block diagram of a device for constructing a de-duplication model of tobacco scientific and technical literature data according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device in an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a method for constructing a tobacco scientific and technical literature data deduplication model according to an embodiment of the present invention, and as shown in fig. 1, an embodiment of the present invention provides a method for constructing a tobacco scientific and technical literature data deduplication model, including:
and S101, acquiring literature data of tobacco science and technology, and removing the duplicate of the literature data to obtain an original data set after the duplicate is removed.
Specifically, document data used for being constructed as a deduplication model is acquired from a database of tobacco science and technology document data, and the acquired document data is deduplicated, wherein deduplication can be that identical data in the document data is deleted, so that document data without duplicated data is obtained, and the document data is recorded as an original data set.
Step S102, sampling first literature data from the original data set, pairing the first literature data, and constructing a duplicate removal data set according to the pairing result.
Specifically, first literature data is randomly sampled from an original data set, the obtained first literature data is paired with each other, for example, when two pieces of first literature data are sampled, two pieces of literature data are matched, and a duplicate removal data set is constructed through the result of pairing.
Step S103, sampling second literature data from the original data set, acquiring a synonymous matching standard of the literature data, converting the second literature data into third literature data through the synonymous matching standard, and establishing a repeated data set through pairing the second literature data and the third literature data.
Specifically, second literature data is obtained by random sampling from an original data set, then synonymy matching standard is carried out on the second literature data to convert the second literature data into third literature data, and the second literature data and the third literature data before and after synonymy conversion are paired to obtain a repeated data set.
In addition, the synonymy matching criterion may be that a keyword in the second literature data is obtained, and then the keyword is converted into a synonym through the synonym in the database of the tobacco technology, so as to obtain third literature data, wherein the keyword selection range may be: a) tobacco section of "Chinese Classification topic word List"; b) tobacco part of "Chinese subject word list"; c) the tobacco part of the International convention of Commodity name and code coordination system; d) statement of official documents in tobacco industry; e) related national standards and industry standards of the tobacco industry; f) tobacco industry related enterprises, research institutions, communities, public published noun terms, and the like.
And S104, performing model training on the duplicate data sets and the duplicate data sets through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology.
Specifically, the deduplication data and the duplicate data set obtained in the above steps are trained through a neural network model, and the specific training steps may be, for example: the method comprises the steps of marking document data in a duplicate data set as 0, marking document data in a duplicate data set as 1, then constructing a twin neural network model for judging whether two document data records are duplicate records, wherein the input of the neural network model is the document data records with the labels of 0 and 1 respectively, and the numerical value between the output result and the value [0,1] represents the probability that the two input document records are the duplicate data records.
According to the method for constructing the tobacco science and technology literature data deduplication model, the literature data of the tobacco science and technology is obtained, deduplication is performed on the literature data, and an original data set after deduplication is obtained; sampling first literature data from an original data set, pairing the first literature data, and constructing a duplicate removal data set according to a pairing result; sampling second literature data from the original data set, acquiring a synonymy matching standard of the literature data, converting the second literature data into third literature data through the synonymy matching standard, and establishing a repeated data set through pairing of the second literature data and the third literature data; and performing model training on the duplicate removal data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology. The method for constructing the de-duplication model for the tobacco science and technology literature data is provided, so that the follow-up de-duplication for the tobacco science and technology literature data can be carried out more efficiently.
On the basis of the above embodiment, the method for constructing the tobacco science and technology literature data deduplication model further includes:
detecting whether the data volume of the duplicate data set and the duplicate data set reaches a preset data volume standard or not;
and when the data quantity of the duplicate removal data set and the data quantity of the duplicate data set do not reach the preset data quantity standard, repeating the steps of constructing the duplicate removal data set and the duplicate data set until the data quantity of the duplicate removal data set and the data quantity of the duplicate data set reach the preset data quantity standard.
In the embodiment of the present invention, before performing model training on the deduplication data set and the duplicate data set through the neural network model, it may be further detected whether the data volumes of the deduplication data set and the duplicate data set reach a preset data volume standard, for example, when it is detected that the data volumes are not enough to train a model with high reliability, the steps of constructing the deduplication data set and the duplicate data set are repeated until the data volumes of the deduplication data set and the duplicate data set reach the preset data volume standard, that is, when a model with high reliability can be constructed, model training is performed through the neural model.
According to the embodiment of the invention, whether a model with high reliability can be constructed is detected, and when the data volume reaches the standard, the model is constructed, so that the reliability of the model is ensured.
On the basis of the above embodiment, the method for constructing the tobacco science and technology literature data deduplication model further includes:
and acquiring a preset data weight table, and distributing weights to the data in the original data set according to the data weight table to obtain the original data set after weight distribution, wherein the weights of the original data set are used for adjusting the sampling probability when the original data set is sampled.
In the embodiment of the present invention, a preset data weight table is obtained, where the data weight table may assign weights according to keywords in document data, for example, the weights of the keywords with high attention and high occurrence frequency are high, and the weights of the keywords with low attention and low occurrence frequency are low, and then assign the weights of the document data according to the weights of the keywords, where the weight of an original data set may affect a sampling probability when the document data is sampled from the original data set, for example, the greater the weight of the document data is, the higher the probability of being sampled is.
According to the embodiment of the invention, the sampling probability of the document data is adjusted through the weight, so that the document data with larger weight can be more targeted in the subsequent model construction, and the model is more efficient in weight checking.
On the basis of the above embodiment, the method for constructing the tobacco science and technology literature data deduplication model further includes:
receiving data to be deduplicated, and randomly selecting data from the original data set to be paired with the data to be deduplicated to obtain data to be deduplicated;
inputting the data to be deduplicated into the deduplication model to obtain the repetition probability of the data to be deduplicated;
and acquiring a repetition judgment threshold, comparing the repetition probability with the repetition judgment threshold, and judging whether the data to be deduplicated are repeated or not according to a comparison result.
In the embodiment of the invention, after a duplication elimination model is constructed, any piece of literature data is randomly selected from an original data set subjected to duplication elimination and is paired with data to be duplicated, paired data to be duplicated are input into the duplication elimination model, the duplication probability of the data to be duplicated is calculated, the duplication probability is compared with a duplication judgment threshold, and according to the comparison result, when the duplication probability is greater than the duplication judgment threshold, the duplication of the data to be duplicated is judged.
In addition, when the document data to be deduplicated and any piece of document data extracted from the original data set that has been deduplicated are both judged not to be duplicated, the document data to be deduplicated is marked as not duplicated and added to the deduplicated document database.
In another embodiment, the method for constructing the tobacco science and technology literature data deduplication model may include the following steps:
and (1.1) collecting and sorting the existing tobacco scientific and technical literature data, performing deduplication processing by adopting a labeling method, and obtaining a tobacco scientific and technical literature data set without repeated data after the step, wherein the tobacco scientific and technical literature data set is marked as an original data set.
Step (1.2), according to the original data set obtained in the step (1.1), obtaining a deduplication training set in the following way: (1.2. a), starting with an empty deduplication training set; (1.2.B) randomly sampling two document data records from the original data set, marking the record pair as non-repeated (value 0), and adding the data pair and the marking result into the duplicate removal data set; (1.2. C), randomly sampling a document data record from the original data set, replacing keywords containing the tobacco science and technology field in the content of the document data with synonyms or synonyms according to a certain probability (less than 30%) to obtain a new document data, marking the document data record pairs of the two days as repetition (value 1), and adding the data pairs and the marking result into the deduplication training set; (1.2. D), repeating (1.2.B) or (1.2. C) until the number of records marked as repeated and non-repeated in the deduplication training set reaches the model training requirement.
Step (1.3), a twin neural network model for judging whether the two literature data records are repeated records is constructed, the input of the neural network model is the two literature data records, and the numerical value between the output results of [0,1] represents the probability that the two input literature records are repeated data records;
and (1.4) training the neural network model constructed in the step (1.3) by using the de-emphasis training set generated in the step (1.2), and storing the model of the training number, which is recorded as a tobacco science and technology literature data de-emphasis model.
Fig. 2 is a device for constructing a tobacco science and technology literature data deduplication model according to an embodiment of the present invention, including: an acquisition module 201, a first sampling module 202, a second sampling module 203, and a training module 204, wherein:
the acquisition module 201 is configured to acquire literature data of tobacco science and technology, and perform deduplication on the literature data to obtain a deduplicated original data set.
The first sampling module 202 is configured to sample first document data from an original data set, pair the first document data, and construct a deduplication data set according to a result of the pairing.
The second sampling module 203 samples second literature data from the original data set, obtains a synonymous matching standard of the literature data, converts the second literature data into third literature data through the synonymous matching standard, and constructs a repeated data set through pairing of the second literature data and the third literature data.
And the training module 204 is configured to perform model training on the duplicate removal data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology.
In one embodiment, the apparatus may further comprise:
and the second acquisition module is used for acquiring the keywords in the second document data and replacing the keywords with synonyms to obtain third document data.
In one embodiment, the apparatus may further comprise:
the detection module is used for detecting whether the data volume of the duplicate data set and the duplicate data set reaches a preset data volume standard or not;
and the repeated sampling module is used for repeating the steps of constructing the duplicate removal data set and the repeated data set until the data amount of the duplicate removal data set and the repeated data set reaches the preset data amount standard when the data amount of the duplicate removal data set and the repeated data set does not reach the preset data amount standard.
In one embodiment, the apparatus may further comprise:
and the third acquisition module is used for acquiring a preset data weight table, distributing weights to the data in the original data set according to the data weight table to obtain the original data set after weight distribution, wherein the weights of the original data set are used for adjusting the sampling probability when the original data set is sampled.
In one embodiment, the apparatus may further comprise:
and the receiving module is used for receiving the data to be deduplicated and randomly selecting data from the original data set to be paired with the data to be deduplicated to obtain the data to be deduplicated.
And the input module is used for inputting the data to be deduplicated into the deduplication model to obtain the repetition probability of the data to be deduplicated.
And the fourth acquisition module is used for acquiring the repeated judgment threshold, comparing the repeated probability with the repeated judgment threshold and judging whether the data to be deduplicated are repeated or not according to the comparison result.
In one embodiment, the apparatus may further comprise:
and the marking module is used for marking the document data in the duplicate data set as 0 and marking the document data in the duplicate data set as 1.
And the second training module is used for performing model training on the literature data and the marks of the de-duplicated data set and the literature data and the marks of the repeated data set through a neural network model.
For specific limitations of the device for constructing the tobacco science and technology literature data deduplication model, reference may be made to the above limitations on the method for constructing the tobacco science and technology literature data deduplication model, and details are not repeated here. All or part of each module in the device for constructing the tobacco science and technology literature data deduplication model can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
Fig. 3 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 3: a processor (processor)301, a memory (memory)302, a communication Interface (Communications Interface)303 and a communication bus 304, wherein the processor 301, the memory 302 and the communication Interface 303 complete communication with each other through the communication bus 304. The processor 301 may call logic instructions in the memory 302 to perform the following method: acquiring literature data of tobacco science and technology, and removing the weight of the literature data to obtain an original data set after the weight is removed; sampling first literature data from the original data set, pairing the first literature data, and constructing a duplicate removal data set according to the pairing result; sampling second literature data from the original data set, acquiring a synonymous matching standard of the literature data, converting the second literature data into third literature data through the synonymous matching standard, and establishing a repeated data set through pairing the second literature data and the third literature data; and performing model training on the duplicate data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology.
Furthermore, the logic instructions in the memory 302 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, and for example, the method includes: acquiring literature data of tobacco science and technology, and removing the weight of the literature data to obtain an original data set after the weight is removed; sampling first literature data from the original data set, pairing the first literature data, and constructing a duplicate removal data set according to the pairing result; sampling second literature data from the original data set, acquiring a synonymous matching standard of the literature data, converting the second literature data into third literature data through the synonymous matching standard, and establishing a repeated data set through pairing the second literature data and the third literature data; and performing model training on the duplicate data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A construction method of a tobacco science and technology literature data deduplication model is characterized by comprising the following steps:
acquiring literature data of tobacco science and technology, and removing the weight of the literature data to obtain an original data set after the weight is removed;
sampling first literature data from the original data set, pairing the first literature data, and constructing a duplicate removal data set according to the pairing result;
sampling second literature data from the original data set, acquiring a synonymous matching standard of the literature data, converting the second literature data into third literature data through the synonymous matching standard, and establishing a repeated data set through pairing the second literature data and the third literature data;
and performing model training on the duplicate data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology.
2. The method for constructing the tobacco science and technology literature data deduplication model according to claim 1, wherein the converting the second literature data into third literature data through the synonymous matching criterion comprises:
and acquiring the keywords in the second literature data, and replacing the keywords with synonyms to obtain the third literature data.
3. The method for constructing a tobacco science and technology literature data deduplication model according to claim 1, wherein before the model training of the deduplication data set and the duplication data set through a neural network model, the method further comprises:
detecting whether the data volume of the duplicate data set and the duplicate data set reaches a preset data volume standard or not;
and when the data quantity of the duplicate removal data set and the data quantity of the duplicate data set do not reach the preset data quantity standard, repeating the steps of constructing the duplicate removal data set and the duplicate data set until the data quantity of the duplicate removal data set and the data quantity of the duplicate data set reach the preset data quantity standard.
4. The method for constructing a tobacco science and technology literature data deduplication model according to claim 1, wherein after obtaining the deduplicated original data set, the method further comprises:
and acquiring a preset data weight table, and distributing weights to the data in the original data set according to the data weight table to obtain the original data set after weight distribution, wherein the weights of the original data set are used for adjusting the sampling probability when the original data set is sampled.
5. The method for constructing the tobacco science and technology literature data deduplication model according to claim 1, wherein after obtaining the deduplication model of the tobacco science and technology literature data, the method further comprises:
receiving data to be deduplicated, and randomly selecting data from the original data set to be paired with the data to be deduplicated to obtain data to be deduplicated;
inputting the data to be deduplicated into the deduplication model to obtain the repetition probability of the data to be deduplicated;
and acquiring a repetition judgment threshold, comparing the repetition probability with the repetition judgment threshold, and judging whether the data to be deduplicated are repeated or not according to a comparison result.
6. The method for constructing the tobacco science and technology literature data deduplication model according to claim 1, wherein the method further comprises:
marking the document data in the duplicate data set as 0, and marking the document data in the duplicate data set as 1;
performing model training on the de-duplicated data set and the duplicated data set through a neural network model, wherein the model training comprises:
and performing model training on the literature data and the marks of the de-duplicated data set and the literature data and the marks of the duplicated data set through a neural network model.
7. A tobacco science and technology literature data deduplication model construction device is characterized by comprising:
the acquisition module is used for acquiring literature data of tobacco science and technology, and removing the duplicate of the literature data to obtain an original data set after the duplicate is removed;
the device comprises a first sampling module, a second sampling module and a matching module, wherein the first sampling module is used for sampling first literature data from the original data set, matching the first literature data and constructing a duplicate removal data set according to a matching result;
the second sampling module is used for sampling second literature data from the original data set, acquiring a synonymous matching standard of the literature data, converting the second literature data into third literature data through the synonymous matching standard, and establishing a repeated data set through pairing of the second literature data and the third literature data;
and the training module is used for carrying out model training on the duplicate data set and the duplicate data set through a neural network model to obtain a duplicate removal model of the literature data of the tobacco science and technology.
8. The apparatus for constructing a tobacco technology literature data deduplication model according to claim 7, wherein the apparatus further comprises:
and the second acquisition module is used for acquiring the keywords in the second document data and replacing the keywords with synonyms to obtain the third document data.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for constructing a de-duplication model of tobacco technology literature data according to any one of claims 1 to 6 when executing the program.
10. A non-transitory computer readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the method for constructing a tobacco technology literature data deduplication model according to any one of claims 1 to 6.
CN202011070240.0A 2020-10-09 2020-10-09 Construction method and device of tobacco science and technology literature data deduplication model Active CN112115236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011070240.0A CN112115236B (en) 2020-10-09 2020-10-09 Construction method and device of tobacco science and technology literature data deduplication model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011070240.0A CN112115236B (en) 2020-10-09 2020-10-09 Construction method and device of tobacco science and technology literature data deduplication model

Publications (2)

Publication Number Publication Date
CN112115236A true CN112115236A (en) 2020-12-22
CN112115236B CN112115236B (en) 2024-02-02

Family

ID=73796859

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011070240.0A Active CN112115236B (en) 2020-10-09 2020-10-09 Construction method and device of tobacco science and technology literature data deduplication model

Country Status (1)

Country Link
CN (1) CN112115236B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113210264A (en) * 2021-05-19 2021-08-06 江苏鑫源烟草薄片有限公司 Method and device for removing tobacco impurities

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170235820A1 (en) * 2016-01-29 2017-08-17 Jack G. Conrad System and engine for seeded clustering of news events
CN110941598A (en) * 2019-12-02 2020-03-31 北京锐安科技有限公司 Data deduplication method, device, terminal and storage medium
CN111274364A (en) * 2020-02-14 2020-06-12 江苏润桐数据服务有限公司 Automatic denoising method and device based on keyword retrieval data
CN111355725A (en) * 2020-02-26 2020-06-30 北京邮电大学 Method and device for detecting network intrusion data
CN111414906A (en) * 2020-03-05 2020-07-14 北京交通大学 Data synthesis and text recognition method for paper bill picture
WO2020147238A1 (en) * 2019-01-18 2020-07-23 平安科技(深圳)有限公司 Keyword determination method, automatic scoring method, apparatus and device, and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170235820A1 (en) * 2016-01-29 2017-08-17 Jack G. Conrad System and engine for seeded clustering of news events
WO2020147238A1 (en) * 2019-01-18 2020-07-23 平安科技(深圳)有限公司 Keyword determination method, automatic scoring method, apparatus and device, and medium
CN110941598A (en) * 2019-12-02 2020-03-31 北京锐安科技有限公司 Data deduplication method, device, terminal and storage medium
CN111274364A (en) * 2020-02-14 2020-06-12 江苏润桐数据服务有限公司 Automatic denoising method and device based on keyword retrieval data
CN111355725A (en) * 2020-02-26 2020-06-30 北京邮电大学 Method and device for detecting network intrusion data
CN111414906A (en) * 2020-03-05 2020-07-14 北京交通大学 Data synthesis and text recognition method for paper bill picture

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
倪娜 等: "科技文献关键词自动标注算法研究", 计算机科学, vol. 39, no. 09 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113210264A (en) * 2021-05-19 2021-08-06 江苏鑫源烟草薄片有限公司 Method and device for removing tobacco impurities
CN113210264B (en) * 2021-05-19 2023-09-05 江苏鑫源烟草薄片有限公司 Tobacco sundry removing method and device

Also Published As

Publication number Publication date
CN112115236B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
CN105005594B (en) Abnormal microblog users recognition methods
CN109656999B (en) Method, device, storage medium and apparatus for synchronizing large data volume data
CN108334489B (en) Text core word recognition method and device
CN105095223A (en) Method for classifying texts and server
CN101630312A (en) Clustering method for question sentences in question-and-answer platform and system thereof
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN103313248A (en) Method and device for identifying junk information
CN105389389A (en) Network public opinion transmission situation media linked analysis method
CN106874448B (en) Method and device for mining earthquake subject term from microblog
CN104035941A (en) Information screening method and device
CN110767211B (en) Voice synthesis broadcasting system based on text content data cleaning
CN112115236B (en) Construction method and device of tobacco science and technology literature data deduplication model
CN111782970B (en) Data analysis method and device
CN112989791A (en) Duplication eliminating method, system and medium based on text information extraction result
CN108475265B (en) Method and device for acquiring unknown words
CN109858017B (en) Data processing method and electronic equipment
CN112115237B (en) Construction method and device of tobacco science and technology literature data recommendation model
CN108038124B (en) PDF document acquisition and processing method, system and device based on big data
CN107329956B (en) Project information standardization method and device
CN115794997A (en) Enterprise matching degree processing method and device based on enterprise labels
CN104484330A (en) Pre-selecting method and device of spam comments based on grading keyword threshold combination evaluation
CN114547233A (en) Data duplicate checking method and device and electronic equipment
CN111026835B (en) Chat subject detection method, device and storage medium
CN106933797B (en) Target information generation method and device
CN112905797A (en) Scenic spot multi-dimensional vulnerability assessment method based on MNLP

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant