WO2021217930A1 - 基于分类模型的论文分类方法、装置、电子设备及介质 - Google Patents

基于分类模型的论文分类方法、装置、电子设备及介质 Download PDF

Info

Publication number
WO2021217930A1
WO2021217930A1 PCT/CN2020/105627 CN2020105627W WO2021217930A1 WO 2021217930 A1 WO2021217930 A1 WO 2021217930A1 CN 2020105627 W CN2020105627 W CN 2020105627W WO 2021217930 A1 WO2021217930 A1 WO 2021217930A1
Authority
WO
WIPO (PCT)
Prior art keywords
paper
sample
target
samples
document
Prior art date
Application number
PCT/CN2020/105627
Other languages
English (en)
French (fr)
Inventor
刘玉
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021217930A1 publication Critical patent/WO2021217930A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • This application relates to the field of data processing technology, and in particular to a method, device, electronic device and medium for categorizing papers based on a classification model.
  • the inventor realizes that in the existing paper classification scheme based on the classification model, there are the following problems: First, the accuracy of the paper classification based on the classification model is low; second, it can only process data in a fixed expression form.
  • the papers to be classified can be predicted through multiple classification models, and accurate target results can be obtained.
  • the first aspect of this application provides a paper classification method based on a classification model.
  • the paper classification method based on the classification model includes:
  • Structural processing is performed on the document information of all the paper samples in the paper sample set to obtain multiple structures of each paper sample and regional information corresponding to each structure;
  • the prediction result with the largest number is determined as the target result of the paper to be classified.
  • a second aspect of the present application provides an electronic device including a processor and a memory, and the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:
  • Structural processing is performed on the document information of all the paper samples in the paper sample set to obtain multiple structures of each paper sample and regional information corresponding to each structure;
  • the prediction result with the largest number is determined as the target result of the paper to be classified.
  • a third aspect of the present application provides a computer-readable storage medium having at least one computer-readable instruction stored thereon, and the at least one computer-readable instruction is executed by a processor to implement the following steps:
  • Structural processing is performed on the document information of all the paper samples in the paper sample set to obtain multiple structures of each paper sample and regional information corresponding to each structure;
  • the prediction result with the largest number is determined as the target result of the paper to be classified.
  • the fourth aspect of the present application provides a paper classification device based on a classification model.
  • the paper classification device based on the classification model includes:
  • the acquisition unit is used to acquire a sample collection of papers
  • the processing unit is used to structure the document information of all the paper samples in the paper sample set to obtain multiple structures of each paper sample and the region information corresponding to each structure;
  • the construction unit is used to construct a training data set corresponding to each structure based on the multiple structures of each paper sample and the region information corresponding to each structure;
  • the training unit is used to train the training samples in each training data set separately to obtain the classification model corresponding to each structure;
  • An extraction unit configured to obtain a paper to be classified, and extract text information from the paper to be classified according to the multiple structures to obtain text information corresponding to each structure;
  • the preprocessing unit is used to preprocess the text information corresponding to each structure to obtain the input information of each structure;
  • the input unit is used to input the input information of each structure into the corresponding classification model to obtain the prediction result of each classification model for the paper to be classified;
  • the determining unit is used to determine the largest number of prediction results as the target result of the paper to be classified.
  • this application uses multiple classification models to predict the papers to be classified and can obtain accurate target results.
  • Fig. 1 is a flowchart of a preferred embodiment of a paper classification method based on a classification model disclosed in the present application.
  • Fig. 2 is a functional block diagram of a preferred embodiment of a paper classification device based on a classification model disclosed in the present application.
  • FIG. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of the present application for implementing a paper classification method based on a classification model.
  • FIG. 1 it is a flowchart of a preferred embodiment of the paper classification method based on the classification model of the present application. According to different needs, the order of the steps in the flowchart can be changed, and some steps can be omitted.
  • the paper classification method based on the classification model is applied to one or more electronic devices.
  • the electronic device is a type that can automatically perform numerical calculations and/or information in accordance with pre-set or stored instructions
  • the processing equipment, its hardware includes but not limited to microprocessor, application specific integrated circuit (ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP) , Embedded devices, etc.
  • the electronic device may be any electronic product that can perform human-computer interaction with the user, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, an interactive network television ( Internet Protocol Television, IPTV), smart wearable devices, etc.
  • a personal computer a tablet computer
  • a smart phone a personal digital assistant (PDA)
  • PDA personal digital assistant
  • IPTV interactive network television
  • smart wearable devices etc.
  • the electronic device may also include a network device and/or user equipment.
  • the network device includes, but is not limited to, a single network electronic device, an electronic device group composed of multiple network electronic devices, or a cloud composed of a large number of hosts or network electronic devices based on cloud computing.
  • the network where the electronic device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), etc.
  • the paper sample set contains multiple paper samples, and each paper sample includes a paper text and a paper category corresponding to the paper text. Further, each paper sample is cleaned by garbled codes, Sample after deduplication and cleaning treatment.
  • the electronic device acquiring a paper sample collection includes:
  • the electronic device scans and recognizes paper-based papers based on Optical Character Recognition (OCR) technology, obtains scanned electronic papers, and crawls electronic papers from preset websites based on web crawler technology Document, obtain the crawled electronic version of the thesis document, and further, the electronic device cleans the scanned electronic version of the thesis document and the crawled electronic version of the thesis document to obtain a thesis sample, and the electronic device will The paper samples are collected into the paper sample collection.
  • OCR Optical Character Recognition
  • the paper-based thesis book includes the paper and the paper category corresponding to the paper; further, the information on the preset website includes the paper and the paper category corresponding to the paper.
  • all electronic paper documents include scanned electronic paper documents and crawled electronic paper documents.
  • the electronic device cleans the scanned electronic version of the thesis document and the crawled electronic version of the thesis document, and the obtained thesis sample includes:
  • the electronic device traverses the text information in all electronic paper documents, and when the traversed text information is garbled, deletes the electronic paper document whose text information is garbled, and uses the retained electronic paper document as the target paper document,
  • the electronic device calculates the hash value of each target thesis document according to the title of each thesis document in the target thesis document, and extracts preset features from each target thesis document and establishes a feature index.
  • the electronic device uses the cosine distance formula to calculate the similarity distance of each two target paper documents according to the hash value of each two target paper documents, and obtains the similarity distance of each paper document pair, where each paper document pair includes any two Target paper documents, search for a paper document pair whose similarity distance is greater than a preset value through the feature index, and determine the paper document pair as a similar paper document pair, and further, the electronic device judges the similar paper document Whether the preset features in the pair are the same, when the preset features in the pair of similar paper documents are the same, the electronic device deletes any paper document in the pair of similar paper documents, and determines the retained paper document as The sample of the paper.
  • the electronic version of the thesis document with garbled text information can be deleted, and the electronic version of the thesis document with garbled text information can be prevented from affecting the subsequent model training.
  • duplicate thesis documents can be deleted, which not only reduces the number of electronic devices The memory usage also reduces the thread usage for processing duplicate papers and documents.
  • the above-mentioned paper sample collection can also be stored in a node of a blockchain.
  • S11 Structural processing is performed on the document information of all the paper samples in the paper sample set to obtain multiple structures of each paper sample and regional information corresponding to each structure.
  • the multiple structures include a title, abstract, introduction, related work, article main article, experiment result, article conclusion, and reference.
  • area information refers to the text information of the paper sample under each structure.
  • the electronic device performs structural processing on the document information of all the paper samples in the paper sample set to obtain multiple structures of each paper sample and the region information corresponding to each structure including:
  • the electronic device searches the document information for tags corresponding to the multiple structures according to paragraphs, where the tags are title, abstract, introduction, related work, article main article, experimental results, article conclusion, and reference documents
  • the electronic device confirms the queried tag as a structure, and extracts information corresponding to the queried tag as the area information.
  • the electronic device constructing a training data set corresponding to each structure based on the multiple structures of each paper sample and the region information corresponding to each structure includes:
  • the electronic device determines each structure, the area information corresponding to each structure, and the paper category corresponding to each area information as a structure sample, and integrates the structure samples with the same structure into the same set to obtain multiple first sets, For each first set, the electronic device calculates the hash value of each structural sample based on the area information. Further, the electronic device calculates the hash value of any two structural samples in the first set according to the hash value.
  • the electronic device compares whether the paper types of the target sample pair are the same, and if the paper types of the target sample pair are different, then The target sample pair is deleted from the first set to obtain a second set, and further, the electronic device calculates the number of structure samples of each paper category in the second set, and compares whether the number is less than expected Set a threshold, and when the number is less than the preset threshold, the electronic device increases the number of structure samples of the paper category corresponding to the number by a perturbation method until the number of structure samples is greater than or equal to the preset threshold , To obtain the training data set.
  • the electronic device separately training the training samples in each training data set to obtain a classification model corresponding to each structure includes:
  • the electronic device randomly selects a training sample, a test sample, and a verification sample from the training data set, and performs word segmentation on the sample information in the training sample to obtain the phrase of each sample information, and further Preferably, the electronic device performs one-hot encoding on the phrase to obtain the encoding vector of the phrase, and the electronic device generates the position vector of the phrase according to the position number of the phrase in the sample information, and further , The electronic device splicing the encoding vector of the phrase and the position vector of the phrase to obtain the feature vector of the phrase of each sample information, and modeling the feature vector of each sample information based on the Roberta technology , The learner is obtained, and the electronic device inputs the test samples into the learner, and calculates the test proportion of the test samples that pass the test. When the test proportion is less than the target value, the electronic device The verification sample adjusts the learner to obtain the classification model.
  • the electronic device adjusting the learner according to the verification sample to obtain the classification model includes:
  • the electronic device uses a hyperparameter grid search method to obtain optimal hyperparameter points from the verification sample. Further, the electronic device adjusts the learner through the optimal hyperparameter points to obtain the Classification model.
  • the electronic device splits the verification sample according to a fixed step size to obtain a target subset, traverses the parameters of the two ends of the target subset, and verifies the learning by the parameters of the two ends.
  • the step length is a preset step length, that is, the obtained hyperparameter point is the optimal hyperparameter point, and further, the electronic device adjusts the learner according to the optimal hyperparameter point to obtain the classification Model.
  • this application does not limit the preset step length.
  • S14 Obtain a paper to be classified, and extract text information from the paper to be classified according to the multiple structures to obtain text information corresponding to each structure.
  • the paper to be classified can be obtained from a request triggered by a user.
  • the method for the electronic device to extract text information from the paper to be classified according to the multiple structures is the same as the method for extracting the region information, which will not be repeated in this application.
  • the electronic device preprocessing the text information corresponding to each structure to obtain the input information of each structure includes:
  • the electronic device segments the text information according to the preset phrase in the preset dictionary to obtain the segmentation position, and constructs at least one directed acyclic ring based on the segmentation position Figure, the electronic device calculates the probability of each directed acyclic graph according to the weight in the preset dictionary, and determines the segmentation position corresponding to the directed acyclic graph with the highest probability as the target segmentation position, so The electronic device determines the first word segmentation according to the target segmentation position.
  • the electronic device filters the stop words in the first word segmentation according to the stop word list to obtain the second word segmentation, and the electronic device calculates each The proportion of the second word segmentation in the training data set, and delete the second word segmentation whose proportion is greater than the configured value to obtain the third word segmentation, the electronic device calculates the word frequency of the third word segmentation in the text information, And sort the third word segmentation according to the word frequency from high to low to obtain a queue.
  • the electronic device selects the first N characters from the queue as the input information, and the N is a positive integer greater than 0 .
  • S16 Input the input information of each structure into the corresponding classification model to obtain the prediction result of each classification model for the paper to be classified.
  • the electronic device after the input information of each structure is input into the corresponding classification model, the electronic device will calculate the classification model based on the input information, and calculate the result of the calculation. As the predicted result.
  • the method further includes:
  • the electronic device enters the paper to be classified and the target result into the paper sample set.
  • the electronic device determines the prediction result with the largest number as the result to be determined. Further, the electronic device obtains a target sample, and uses the target sample to test each category Model, and calculate the target proportion of the target sample that each model passes the test. The electronic device uses the target proportion of each classification model as the weight of each classification model, and calculates each result to be determined according to each weight. A weighted sum operation is performed to obtain the predicted score of each result to be determined. Further, the electronic device determines the result to be determined with the highest predicted score as the target result.
  • the prediction result of the title classification model is the result A
  • the prediction result of the summary classification model is the result B
  • the prediction result of the profile classification model is the result C
  • the prediction result of the related work classification model is the result A
  • the prediction result of the article main article classification model is the result B
  • the experimental result is the result C
  • the prediction result of the article conclusion classification model is the result A
  • the prediction result of the reference classification model is the result B.
  • the target proportion of the article main article classification model is 0.4
  • the target proportion of the experimental result classification model is 0.7
  • the target proportion of the article conclusion classification model is 0.8
  • the target proportion of the reference classification model is 0.9.
  • the target proportion is used as each weight.
  • the prediction score of each result to be determined is obtained.
  • the prediction score of the result A is the highest, and the result A is determined as the target result.
  • this application uses multiple classification models to predict the papers to be classified and can obtain accurate target results.
  • FIG. 2 it is a functional block diagram of a preferred embodiment of the paper classification device based on the classification model of the present application.
  • the article classification device 11 based on the classification model includes an acquisition unit 110, a processing unit 111, a construction unit 112, a training unit 113, an extraction unit 114, a preprocessing unit 115, an input unit 116, a determination unit 117, and an entry unit 118.
  • the module/unit referred to in this application refers to a series of computer program segments that can be executed by the processor 13 and can complete fixed functions, and are stored in the memory 12. In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.
  • the obtaining unit 110 obtains a paper sample collection.
  • the above-mentioned paper sample collection can also be stored in a node of a blockchain.
  • the paper sample set contains multiple paper samples, and each paper sample includes a paper text and a paper category corresponding to the paper text. Further, each paper sample is cleaned by garbled codes, Sample after deduplication and cleaning treatment.
  • the acquiring unit 110 acquiring a paper sample collection includes:
  • the acquisition unit 110 scans and recognizes the paper version of the paper based on optical character recognition technology (Optical Character Recognition, OCR), obtains the scanned electronic version of the paper document, and crawls the electronic version from the preset website based on the web crawler technology
  • OCR optical Character Recognition
  • the thesis document obtains the crawled electronic version of the thesis document, and further, the acquisition unit 110 cleans the scanned electronic version of the thesis document and the crawled electronic version of the thesis document to obtain a thesis sample, the acquisition unit 110 collects the paper samples into the paper sample set.
  • the paper-based thesis book includes the paper and the paper category corresponding to the paper; further, the information on the preset website includes the paper and the paper category corresponding to the paper.
  • all electronic paper documents include scanned electronic paper documents and crawled electronic paper documents.
  • the acquiring unit 110 performs cleaning processing on the scanned electronic version of the thesis document and the crawled electronic version of the thesis document, and obtaining the paper sample includes:
  • the acquiring unit 110 traverses the text information in all the electronic version of the thesis documents, and when the traversed text information is garbled, deletes the electronic version of the thesis document whose text information is garbled, and uses the retained electronic version of the thesis document as the target paper document ,
  • the acquisition unit 110 calculates the hash value of each target thesis document according to the title of each thesis document in the target thesis document, and extracts preset features from each target thesis document and establishes a feature index, and further ,
  • the acquisition unit 110 uses the cosine distance formula to calculate the similarity distance of each two target paper documents according to the hash value of each two target paper documents, and obtains the similarity distance of each paper document pair, where each paper document pair Include any two target paper documents, search for a paper document pair whose similarity distance is greater than a preset value through the feature index, and determine the paper document pair as a similar paper document pair.
  • the acquiring unit 110 determines Whether the preset features in the pair of similar paper documents are the same, when the preset features in the pair of similar paper documents are the same, the acquiring unit 110 deletes any one of the paper documents in the pair of similar paper documents, and keeps it The paper document of is determined as the paper sample.
  • electronic paper documents with garbled text information can be deleted, and electronic paper documents with garbled text information can be prevented from affecting subsequent model training.
  • duplicate paper documents can be deleted, which not only reduces the occupation of electronic equipment Memory also reduces the thread occupation for processing duplicate papers and documents.
  • the processing unit 111 performs structured processing on the document information of all the paper samples in the paper sample set to obtain multiple structures of each paper sample and regional information corresponding to each structure.
  • the multiple structures include a title, abstract, introduction, related work, article main article, experiment result, article conclusion, and reference.
  • area information refers to the text information of the paper sample under each structure.
  • the processing unit 111 performs structural processing on the document information of all the paper samples in the paper sample set to obtain multiple structures of each paper sample and the region information corresponding to each structure including :
  • the processing unit 111 searches the document information for tags corresponding to the multiple structures according to paragraphs.
  • the tags are title, abstract, introduction, related work, article main article, experimental results, article conclusion, reference
  • the processing unit 111 confirms the queried tag as a structure, and extracts information corresponding to the queried tag as the area information.
  • the construction unit 112 constructs a training data set corresponding to each structure based on the multiple structures of each paper sample and the region information corresponding to each structure.
  • the construction unit 112 constructing a training data set corresponding to each structure based on the multiple structures of each paper sample and the region information corresponding to each structure includes:
  • the construction unit 112 determines each structure, the area information corresponding to each structure, and the paper category corresponding to each area information as a structure sample, and integrates the structure samples with the same structure into the same set to obtain multiple first sets For each first set, the construction unit 112 calculates the hash value of each structure sample based on the region information. Further, the construction unit 112 calculates any two of the first set according to the hash value.
  • the construction unit 112 compares whether the paper categories of the target sample pairs are the same, if the paper categories of the target sample pairs Different, the target sample pair is deleted from the first set to obtain a second set, and further, the construction unit 112 calculates the number of structural samples of each paper category in the second set, and compares all Whether the number is less than the preset threshold, when the number is less than the preset threshold, the construction unit 112 increases the number of structure samples of the paper category corresponding to the number by the perturbation method, until the number of structure samples is greater than or Equal to the preset threshold to obtain the training data set.
  • the training unit 113 separately trains the training samples in each training data set to obtain a classification model corresponding to each structure.
  • the training unit 113 separately trains the training samples in each training data set to obtain a classification model corresponding to each structure includes:
  • the training unit 113 randomly selects training samples, test samples, and verification samples from the training data set, and performs word segmentation on the sample information in the training samples to obtain the phrase of each sample information, Further, the training unit 113 performs one-hot encoding on the phrase to obtain the coding vector of the phrase, and the training unit 113 generates the position vector of the phrase according to the position number of the phrase in the sample information, Furthermore, the training unit 113 splices the encoding vector of the phrase and the position vector of the phrase to obtain the feature vector of the phrase of each sample information, and calculates the feature vector of each sample information based on the RoBerta technology. The vector is modeled to obtain a learner. The training unit 113 inputs the test samples into the learner and calculates the test proportion of the test samples that pass the test. When the test proportion is less than the target value, The training unit 113 adjusts the learner according to the verification sample to obtain the classification model.
  • the training unit 113 adjusting the learner according to the verification sample to obtain the classification model includes:
  • the training unit 113 uses a hyperparameter grid search method to obtain optimal hyperparameter points from the verification sample. Further, the training unit 113 adjusts the learner through the optimal hyperparameter points to obtain The classification model.
  • the training unit 113 splits the verification sample according to a fixed step size to obtain a target subset, traverses the parameters of the two ends of the target subset, and verifies the parameters of the two ends of the target subset.
  • the learner obtains the learning rate of each parameter, determines the parameter with the best learning rate as the first hyperparameter point, and in the neighborhood of the first hyperparameter point, reduces the step size and continues to traverse until
  • the step size is a preset step size, that is, the obtained hyperparameter point is the optimal hyperparameter point.
  • the training unit 113 adjusts the learner according to the optimal hyperparameter point to obtain the optimal hyperparameter point.
  • the classification model is a preset step size, that is, the obtained hyperparameter point is the optimal hyperparameter point.
  • this application does not limit the preset step length.
  • the extracting unit 114 obtains the paper to be classified, and extracts text information from the paper to be classified according to the multiple structures to obtain text information corresponding to each structure.
  • the paper to be classified can be obtained from a request triggered by a user.
  • the extraction unit 114 extracts text information from the paper to be classified according to the multiple structures in the same manner as extracting the region information, which will not be repeated in this application.
  • the preprocessing unit 115 preprocesses the text information corresponding to each structure to obtain the input information of each structure.
  • the preprocessing unit 115 preprocesses the text information corresponding to each structure, and obtains the input information of each structure includes:
  • the preprocessing unit 115 segments the text information according to the preset phrase in the preset dictionary to obtain the segmentation position, and constructs at least one directed segment based on the segmentation position. For acyclic graphs, the preprocessing unit 115 calculates the probability of each directed acyclic graph according to the weights in the preset dictionary, and determines the segmentation position corresponding to the directed acyclic graph with the highest probability as the target cut The preprocessing unit 115 determines the first word segmentation according to the target segmentation position. Further, the preprocessing unit 115 filters the stop words in the first word segmentation according to the stop word list to obtain the second word segmentation. Word segmentation.
  • the preprocessing unit 115 calculates the proportion of each second word segmentation in the training data set, and deletes the second word segmentation whose proportion is greater than the configured value to obtain the third word segmentation.
  • the preprocessing unit 115 calculates the The word frequency of the third word segmentation in the text information, and the third word segmentation is sorted according to the word frequency from high to low to obtain a queue.
  • the preprocessing unit 115 selects the first N characters from the queue as In the input information, the N is a positive integer greater than zero.
  • the input unit 116 inputs the input information of each structure into the corresponding classification model to obtain the prediction result of each classification model for the paper to be classified.
  • the input unit 116 after inputting the input information of each structure into the corresponding classification model, the input unit 116 will calculate the classification model based on the input information, and combine the calculated The result is the predicted result.
  • the determining unit 117 determines the largest number of prediction results as the target result of the paper to be classified.
  • the entry unit 118 enters the papers to be classified and the target results into the paper sample set.
  • the determination unit 117 determines the prediction result with the largest number as the result to be determined. Further, the determination unit 117 obtains the target sample, Use the target sample to test each classification model, and calculate the target proportion of the target sample that each model passes the test. The determining unit 117 uses the target proportion of each classification model as the weight of each classification model, according to Each weight performs a weighted sum operation on each result to be determined to obtain the prediction score of each result to be determined. Further, the determination unit 117 determines the result to be determined with the highest prediction score as the target result.
  • the prediction result of the title classification model is the result A
  • the prediction result of the summary classification model is the result B
  • the prediction result of the profile classification model is the result C
  • the prediction result of the related work classification model is the result A
  • the prediction result of the article main article classification model is the result B
  • the experimental result is the result C
  • the prediction result of the article conclusion classification model is the result A
  • the prediction result of the reference classification model is the result B.
  • the target proportion of the article main article classification model is 0.4
  • the target proportion of the experimental result classification model is 0.7
  • the target proportion of the article conclusion classification model is 0.8
  • the target proportion of the reference classification model is 0.9.
  • the target proportion is used as each weight.
  • the prediction score of each result to be determined is obtained.
  • the prediction score of the result A is the highest, and the result A is determined as the target result.
  • this application uses multiple classification models to predict the papers to be classified and can obtain accurate target results.
  • FIG. 3 it is a schematic diagram of the structure of an electronic device in a preferred embodiment of the present application for implementing a paper classification method based on a classification model.
  • the electronic device 1 includes, but is not limited to, a memory 12, a processor 13, and a computer program stored in the memory 12 and running on the processor 13, such as Paper classification program based on classification model.
  • the schematic diagram is only an example of the electronic device 1 and does not constitute a limitation on the electronic device 1.
  • the electronic device 1 may also include an input/output device, a network access device, a bus, and the like.
  • the processor 13 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc.
  • the processor 13 is the computing core and control center of the electronic device 1 and connects the entire electronic device with various interfaces and lines. Each part of 1, and executes the operating system of the electronic device 1, and various installed applications, program codes, etc.
  • the processor 13 executes the operating system of the electronic device 1 and various installed applications.
  • the processor 13 executes the application program to implement the steps in each of the above-mentioned embodiments of the paper classification method based on the classification model, for example, the steps shown in FIG. 1.
  • the computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 12 and executed by the processor 13 to complete the present invention.
  • the one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program in the electronic device 1.
  • the computer program can be divided into an acquisition unit 110, a processing unit 111, a construction unit 112, a training unit 113, an extraction unit 114, a preprocessing unit 115, an input unit 116, a determination unit 117, and an entry unit 118.
  • the memory 12 may be used to store the computer program and/or module.
  • the processor 13 runs or executes the computer program and/or module stored in the memory 12 and calls data stored in the memory 12, The various functions of the electronic device 1 are realized.
  • the memory 12 may mainly include a storage program area and a storage data area.
  • the storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may Stores data, etc. created based on the use of electronic devices.
  • the memory 12 may include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, a flash memory card (Flash Card), At least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device.
  • non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, a flash memory card (Flash Card), At least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device.
  • the memory 12 may be an external memory and/or an internal memory of the electronic device 1. Further, the memory 12 may be a memory in a physical form, such as a memory stick, a TF card (Trans-flash Card), and so on.
  • TF card Trans-flash Card
  • the integrated module/unit of the electronic device 1 may be stored in a computer-readable storage medium, which may be non-easy.
  • a volatile storage medium can also be a volatile storage medium.
  • the computer program includes computer program code
  • the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) .
  • the memory 12 in the electronic device 1 stores multiple instructions to implement a paper classification method based on a classification model, and the processor 13 can execute the multiple instructions to achieve: obtain a paper sample collection ;
  • the document information of all the paper samples in the paper sample set is structured to obtain multiple structures of each paper sample and the region information corresponding to each structure; based on the multiple structures of each paper sample and each structure corresponding Construct the training data set corresponding to each structure; train the training samples in each training data set separately to obtain the classification model corresponding to each structure; obtain the papers to be classified, and obtain the papers to be classified according to the multiple structures.
  • the prediction result of each classification model for the paper to be classified is obtained; the prediction result with the largest number is determined as the target result of the paper to be classified.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于分类模型的论文分类方法,涉及人工智能。该方法能够获取论文样本集,对论文样本集中所有论文样本的文档信息进行结构化处理,得到每个论文样本的多个结构及每个结构对应的区域信息,并构建与每个结构对应的训练数据集,分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型,获取待分类论文,并提取文本信息,得到每个结构对应的文本信息,对每个结构对应的文本信息进行预处理,得到每个结构的输入信息,将每个结构的输入信息输入对应的分类模型中,得到每个分类模型对所述待分类论文的预测结果,将数量最多的预测结果确定为待分类论文的目标结果。该方法还涉及区块链技术,论文样本集可存储于区块链中。

Description

基于分类模型的论文分类方法、装置、电子设备及介质
本申请要求于2020年04月30日提交中国专利局,申请号为202010368034.1,发明名称为“基于分类模型的论文分类方法、装置、电子设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种基于分类模型的论文分类方法、装置、电子设备及介质。
背景技术
目前,随着学术研究人员的增多,论文等研究成果也层出不穷,在众多的研究成果中,学术论文的数量呈现爆炸性增长的趋势,由于不同用户对于不同领域的研究各有不同,因此,用户在进行科学研究时会根据自身的研究领域进行查阅文献,由此,对论文进行分类并打上标签能够提高论文的检索效率,加快科研工作。然而,一般论文的篇幅较长,采取人工阅读的方式将影响论文的检索效率,为此,对论文进行自动分类的方法应运而生。
发明人意识到,在现有的基于分类模型的论文分类方案中,存在以下问题:其一,基于分类模型的论文分类的准确率低;其二,只能处理固定表达形式的数据。
发明内容
鉴于以上内容,有必要提供一种基于分类模型的论文分类方法、装置、电子设备及介质,通过多个分类模型对所述待分类论文进行预测,能够得到准确的目标结果。
本申请的第一方面提供一种基于分类模型的论文分类方法,所述基于分类模型的论文分类方法包括:
获取论文样本集;
对所述论文样本集中所有论文样本的文档信息进行结构化处理,得到每个论文样本的多个结构及每个结构对应的区域信息;
基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集;
分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型;
获取待分类论文,并根据所述多个结构从所述待分类论文中提取文本信息,得到每个结构对应的文本信息;
对每个结构对应的文本信息进行预处理,得到每个结构的输入信息;
将每个结构的输入信息分别输入对应的分类模型中,得到每个分类模型对所述待分类论文的预测结果;
将数量最多的预测结果确定为所述待分类论文的目标结果。
本申请的第二方面提供一种电子设备,所述电子设备包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令以实现以下步骤:
获取论文样本集;
对所述论文样本集中所有论文样本的文档信息进行结构化处理,得到每个论文样本的多个结构及每个结构对应的区域信息;
基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集;
分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型;
获取待分类论文,并根据所述多个结构从所述待分类论文中提取文本信息,得到每个结构对应的文本信息;
对每个结构对应的文本信息进行预处理,得到每个结构的输入信息;
将每个结构的输入信息分别输入对应的分类模型中,得到每个分类模型对所述待分类论文的预测结果;
将数量最多的预测结果确定为所述待分类论文的目标结果。
本申请的第三方面提供一种计算机可读存储介质,所述计算机可读存储介质上存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行以实现以下步骤:
获取论文样本集;
对所述论文样本集中所有论文样本的文档信息进行结构化处理,得到每个论文样本的多个结构及每个结构对应的区域信息;
基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集;
分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型;
获取待分类论文,并根据所述多个结构从所述待分类论文中提取文本信息,得到每个结构对应的文本信息;
对每个结构对应的文本信息进行预处理,得到每个结构的输入信息;
将每个结构的输入信息分别输入对应的分类模型中,得到每个分类模型对所述待分类论文的预测结果;
将数量最多的预测结果确定为所述待分类论文的目标结果。
本申请的第四方面提供一种基于分类模型的论文分类装置,所述基于分类模型的论文分类装置包括:
获取单元,用于获取论文样本集;
处理单元,用于对所述论文样本集中所有论文样本的文档信息进行结构化处理,得到每个论文样本的多个结构及每个结构对应的区域信息;
构建单元,用于基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集;
训练单元,用于分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型;
提取单元,用于获取待分类论文,并根据所述多个结构从所述待分类论文中提取文本信息,得到每个结构对应的文本信息;
预处理单元,用于对每个结构对应的文本信息进行预处理,得到每个结构的输入信息;
输入单元,用于将每个结构的输入信息分别输入对应的分类模型中,得到每个分类模型对所述待分类论文的预测结果;
确定单元,用于将数量最多的预测结果确定为所述待分类论文的目标结果。
由以上技术方案可以看出,本申请通过多个分类模型对所述待分类论文进行预测,能够得到准确的目标结果。
附图说明
图1是本申请公开的一种基于分类模型的论文分类方法的较佳实施例的流程图。
图2是本申请公开的一种基于分类模型的论文分类装置的较佳实施例的功能模块图。
图3是本申请实现基于分类模型的论文分类方法的较佳实施例的电子设备的结构示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本申请进行详细描述。
如图1所示,是本申请基于分类模型的论文分类方法的较佳实施例的流程图。根据不同的需求,该流程图中步骤的顺序可以改变,某些步骤可以省略。
本申请涉及人工智能,所述基于分类模型的论文分类方法应用于一个或者多个电子设备中,所述电子设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述电子设备可以是任何一种可与用户进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能式穿戴式设备等。
所述电子设备还可以包括网络设备和/或用户设备。其中,所述网络设备包括,但不限于单个网络电子设备、多个网络电子设备组成的电子设备组或基于云计算(Cloud Computing)的由大量主机或网络电子设备构成的云。
所述电子设备所处的网络包括但不限于互联网、广域网、城域网、局域网、虚拟专用网络(Virtual Private Network,VPN)等。
S10,获取论文样本集。
在本申请的至少一个实施例中,所述论文样本集含有多个论文样本,每个论文样本包括一个论文文本及该论文文本对应的论文类别,进一步地,每个论文样本是经过乱码清洗、去重清洗处理后的样本。
在本申请的至少一个实施例中,所述电子设备获取论文样本集包括:
所述电子设备基于光学字符识别技术(Optical Character Recognition,OCR)对纸质版论文书籍进行扫描识别,得到扫描到的电子版论文文档,及基于网络爬虫技术从预设网站上爬取电子版论文文档,得到爬取到的电子版论文文档,进一步地,所述电子设备对扫描到的电子版论文文档及爬取到的电子版论文文档进行清洗处理,得到论文样本,所述电子设备将所述论文样本集合成所述论文样本集。
其中,所述纸质版论文书籍中包括论文及论文对应的论文类别;进一步地,所述预设网站上的信息包括论文及论文对应的论文类别。
通过上述实施方式,能够在不局限于书籍或者网站信息的情况下,全面地获取到多个论文样本。
在本申请的至少一个实施例中,所有电子版论文文档包括扫描到的电子版论文文档及爬取到的电子版论文文档。
在本申请的至少一个实施例中,所述电子设备对扫描到的电子版论文文档及爬取到的电子版论文文档进行清洗处理,得到论文样本包括:
所述电子设备遍历所有电子版论文文档中的文本信息,当遍历到的文本信息为乱码时,将文本信息为乱码的电子版论文文档删除,并将保留的电子版论文文档作为目标论文文档,所述电子设备根据所述目标论文文档中每个论文文档的标题,计算每个目标论文文档的哈希值,并从每个目标论文文档中抽取预设特征并建立特征索引,进一步地,所述电子设备根据每两个目标论文文档的哈希值,采用余弦距离公式计算每两个目标论文文档的相似距离,得 到每个论文文档对的相似距离,其中,每个论文文档对包括任意两个目标论文文档,通过所述特征索引搜索出相似距离大于预设值的论文文档对,并将该论文文档对确定为相似论文文档对,更进一步地,所述电子设备判断所述相似论文文档对中的预设特征是否相同,当所述相似论文文档对中的预设特征相同时,所述电子设备删除所述相似论文文档对中的任意一个论文文档,并将保留的论文文档确定为所述论文样本。
通过上述实施方式,能够将文本信息为乱码的电子版论文文档删除,避免文本信息为乱码的电子版论文文档影响后续模型的训练,此外,能够删除重复的论文文档,不仅减少了所述电子设备的占用内存,还减少了处理重复的论文文档的占用线程。
需要强调的是,为进一步保证上述论文样本集的私密和安全性,上述论文样本集还可以存储于一区块链的节点中。
S11,对所述论文样本集中所有论文样本的文档信息进行结构化处理,得到每个论文样本的多个结构及每个结构对应的区域信息。
在本申请的至少一个实施例中,所述多个结构包括标题、摘要、简介、相关工作、文章主文、实验结果、文章结论、参考文献。进一步地,所述区域信息是指论文样本在每个结构下的文本信息。
在本申请的至少一个实施例中,所述电子设备对所述论文样本集中所有论文样本的文档信息进行结构化处理,得到每个论文样本的多个结构及每个结构对应的区域信息包括:
对于每个论文样本,所述电子设备在文档信息中按照段落查询所述多个结构对应的标签,所述标签为标题、摘要、简介、相关工作、文章主文、实验结果、文章结论、参考文献,当查询到标签时,所述电子设备将查询到的标签确认为结构,并提取查询到的标签对应的信息,作为所述区域信息。
S12,基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集。
在本申请的至少一个实施例中,所述电子设备基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集包括:
所述电子设备将每个结构、每个结构对应的区域信息及每个区域信息对应的论文类别确定为一个结构样本,并将结构相同的结构样本整合成同一集合,得到多个第一集合,对于每个第一集合,所述电子设备基于区域信息计算每个结构样本的哈希值,进一步地,所述电子设备根据所述哈希值计算所述第一集合中任意两个结构样本的相似度,并将相似度为1的任意两个结构样本确定为目标样本对,所述电子设备对比所述目标样本对的论文类别是否相同,若所述目标样本对的论文类别不同,从所述第一集合中删除所述目标样本对,得到第二集合,更进一步地,所述电子设备计算所述第二集合中每个论文类别的结构样本的数量,并对比所述数量是否小于预设阈值,当所述数量小于所述预设阈值时,所述电子设备通过扰动法增加与所述数量对应的论文类别的结构样本的数量,直至结构样本的数量大于或者等于所述预设阈值,得到所述训练数据集。
S13,分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型。
在本申请的至少一个实施例中,所述电子设备分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型包括:
对于每个训练数据集,所述电子设备从所述训练数据集中随机选取训练样本、测试样本及验证样本,并对所述训练样本中的样本信息进行分词,得到每个样本信息的词组,进一步地,所述电子设备对所述词组进行one-hot编码,得到所述词组的编码向量,所述电子设备根据所述词组在样本信息中的位置编号生成所述词组的位置向量,更进一步地,所述电子设备拼接所述词组的编码向量及所述词组的位置向量,得到每个样本信息的所述词组的特征向量,并基于RoBerta技术对每个样本信息的所述特征向量进行建模,得到学习器,所述电子设备将所述测试样本输入所述学习器中,并计算通过测试的测试样本的测试占比,当所述测试占 比小于目标值时,所述电子设备根据所述验证样本调整所述学习器,得到所述分类模型。
在本申请的至少一个实施例中,所述电子设备根据所述验证样本调整所述学习器,得到所述分类模型包括:
所述电子设备采用超参数网格搜索方法从所述验证样本中获取最优超参数点,进一步地,所述电子设备通过所述最优超参数点对所述学习器进行调整,得到所述分类模型。
具体地,所述电子设备将所述验证样本按照固定步长进行拆分,得到目标子集,遍历所述目标子集上两端端点的参数,通过所述两端端点的参数验证所述学习器,得到每个参数的学习率,将所述学习率最好的参数确定为第一超参数点,并在所述第一超参数点的邻域内,缩小所述步长继续遍历,直至所述步长为预设步长,即得到的超参数点为所述最优超参数点,更进一步地,所述电子设备根据所述最优超参数点调整所述学习器,得到所述分类模型。
其中,本申请对所述预设步长不作限制。
通过上述实施方式,能够得到较为精确的所述分类模型。
S14,获取待分类论文,并根据所述多个结构从所述待分类论文中提取文本信息,得到每个结构对应的文本信息。
在本申请的至少一个实施例中,所述待分类论文可以从用户触发的请求中获取。
进一步地,所述电子设备根据所述多个结构从所述待分类论文中提取文本信息的方式与提取所述区域信息的方式相同,本申请对此不再赘述。
S15,对每个结构对应的文本信息进行预处理,得到每个结构的输入信息。
在本申请的至少一个实施例中,所述电子设备对每个结构对应的文本信息进行预处理,得到每个结构的输入信息包括:
对于每个结构对应的文本信息,所述电子设备根据预设词典中的预设词组对所述文本信息进行切分,得到切分位置,并以所述切分位置构建至少一个有向无环图,所述电子设备根据所述预设词典中的权值计算每个有向无环图的概率,并将概率最大的有向无环图对应的切分位置确定为目标切分位置,所述电子设备根据所述目标切分位置确定第一分词,进一步地,所述电子设备根据停用词表过滤所述第一分词中的停用词,得到第二分词,所述电子设备计算每个第二分词在所述训练数据集中的占比,并删除占比大于配置值的第二分词,得到第三分词,所述电子设备计算所述第三分词在所述文本信息中的词频,并依据词频的高低从高到低对所述第三分词进行排序,得到队列,所述电子设备从所述队列中选取前N个字符作为所述输入信息,所述N为大于0的正整数。
S16,将每个结构的输入信息分别输入对应的分类模型中,得到每个分类模型对所述待分类论文的预测结果。
在本申请的至少一个实施例中,在将每个结构的输入信息分别输入对应的分类模型后,所述电子设备会基于所述输入信息对所述分类模型进行运算,并将运算后的结果作为所述预测结果。
S17,将数量最多的预测结果确定为所述待分类论文的目标结果。
在本申请的至少一个实施例中,所述方法还包括:
(1)当每个分类模型对所述待分类论文的预测结果一致时,所述电子设备将所述待分类论文及所述目标结果录入所述论文样本集。
(2)当数量最多的预测结果有多个时,所述电子设备将数量最多的预测结果确定为待确定结果,进一步地,所述电子设备获取目标样本,采用所述目标样本测试每个分类模型,并计算每个模型通过测试的目标样本的目标占比,所述电子设备将每个分类模型的目标占比作为每个分类模型的权值,根据每个权值对每个待确定结果进行加权和运算,得到每个待确定结果的预测分数,进一步地,所述电子设备将预测分数最高的待确定结果确定为所述目标结果。
例如:标题分类模型的预测结果为A结果、摘要分类模型的预测结果为B结果、简介分 类模型的预测结果为C结果、相关工作分类模型的预测结果为A结果、文章主文分类模型的预测结果为B结果、实验结果分类模型的预测结果为C结果、文章结论分类模型的预测结果为A结果、参考文献分类模型的预测结果为B结果,经计算,得到预测结果为A结果的数量有3个,得到预测结果为B结果的数量有3个,得到预测结果为C结果的数量有2个,得到数量最多的预测结果有2个,即待确定结果分别为A结果和B结果,获取目标样本,采用目标样本测试每个分类模型,得到标题分类模型的目标占比为0.8、摘要分类模型的目标占比为0.6、简介分类模型的目标占比为0.5、相关工作分类模型的目标占比为0.8、文章主文分类模型的目标占比为0.4、实验结果分类模型的目标占比为0.7、文章结论分类模型的目标占比为0.8、参考文献分类模型的目标占比为0.9,将每个目标占比作为每个权值,经计算,得到每个待确定结果的预测分数,A结果的预测分数为:0.8+0.8+0.8=2.4,B结果的预测分数为:0.6+0.4+0.9=1.9,得到A结果的预测分数最高,将A结果确定为所述目标结果。
由以上技术方案可以看出,本申请通过多个分类模型对所述待分类论文进行预测,能够得到准确的目标结果。
以上所述,仅是本申请的具体实施方式,但本申请的保护范围并不局限于此,对于本领域的普通技术人员来说,在不脱离本申请创造构思的前提下,还可以做出改进,但这些均属于本申请的保护范围。
如图2所示,是本申请基于分类模型的论文分类装置的较佳实施例的功能模块图。所述基于分类模型的论文分类装置11包括获取单元110、处理单元111、构建单元112、训练单元113、提取单元114、预处理单元115、输入单元116、确定单元117及录入单元118。本申请所称的模块/单元是指一种能够被处理器13所执行,并且能够完成固定功能的一系列计算机程序段,其存储在存储器12中。在本实施例中,关于各模块/单元的功能将在后续的实施例中详述。
获取单元110获取论文样本集。
需要强调的是,为进一步保证上述论文样本集的私密和安全性,上述论文样本集还可以存储于一区块链的节点中。
在本申请的至少一个实施例中,所述论文样本集含有多个论文样本,每个论文样本包括一个论文文本及该论文文本对应的论文类别,进一步地,每个论文样本是经过乱码清洗、去重清洗处理后的样本。
在本申请的至少一个实施例中,所述获取单元110获取论文样本集包括:
所述获取单元110基于光学字符识别技术(Optical Character Recognition,OCR)对纸质版论文书籍进行扫描识别,得到扫描到的电子版论文文档,及基于网络爬虫技术从预设网站上爬取电子版论文文档,得到爬取到的电子版论文文档,进一步地,所述获取单元110对扫描到的电子版论文文档及爬取到的电子版论文文档进行清洗处理,得到论文样本,所述获取单元110将所述论文样本集合成所述论文样本集。
其中,所述纸质版论文书籍中包括论文及论文对应的论文类别;进一步地,所述预设网站上的信息包括论文及论文对应的论文类别。
通过上述实施方式,能够在不局限于书籍或者网站信息的情况下,全面地获取到多个论文样本。
在本申请的至少一个实施例中,所有电子版论文文档包括扫描到的电子版论文文档及爬取到的电子版论文文档。
在本申请的至少一个实施例中,所述获取单元110对扫描到的电子版论文文档及爬取到的电子版论文文档进行清洗处理,得到论文样本包括:
所述获取单元110遍历所有电子版论文文档中的文本信息,当遍历到的文本信息为乱码时,将文本信息为乱码的电子版论文文档删除,并将保留的电子版论文文档作为目标论文文档,所述获取单元110根据所述目标论文文档中每个论文文档的标题,计算每个目标论文文 档的哈希值,并从每个目标论文文档中抽取预设特征并建立特征索引,进一步地,所述获取单元110根据每两个目标论文文档的哈希值,采用余弦距离公式计算每两个目标论文文档的相似距离,得到每个论文文档对的相似距离,其中,每个论文文档对包括任意两个目标论文文档,通过所述特征索引搜索出相似距离大于预设值的论文文档对,并将该论文文档对确定为相似论文文档对,更进一步地,所述获取单元110判断所述相似论文文档对中的预设特征是否相同,当所述相似论文文档对中的预设特征相同时,所述获取单元110删除所述相似论文文档对中的任意一个论文文档,并将保留的论文文档确定为所述论文样本。
通过上述实施方式,能够将文本信息为乱码的电子版论文文档删除,避免文本信息为乱码的电子版论文文档影响后续模型的训练,此外,能够删除重复的论文文档,不仅减少了电子设备的占用内存,还减少了处理重复的论文文档的占用线程。
处理单元111对所述论文样本集中所有论文样本的文档信息进行结构化处理,得到每个论文样本的多个结构及每个结构对应的区域信息。
在本申请的至少一个实施例中,所述多个结构包括标题、摘要、简介、相关工作、文章主文、实验结果、文章结论、参考文献。进一步地,所述区域信息是指论文样本在每个结构下的文本信息。
在本申请的至少一个实施例中,所述处理单元111对所述论文样本集中所有论文样本的文档信息进行结构化处理,得到每个论文样本的多个结构及每个结构对应的区域信息包括:
对于每个论文样本,所述处理单元111在文档信息中按照段落查询所述多个结构对应的标签,所述标签为标题、摘要、简介、相关工作、文章主文、实验结果、文章结论、参考文献,当查询到标签时,所述处理单元111将查询到的标签确认为结构,并提取查询到的标签对应的信息,作为所述区域信息。
构建单元112基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集。
在本申请的至少一个实施例中,所述构建单元112基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集包括:
所述构建单元112将每个结构、每个结构对应的区域信息及每个区域信息对应的论文类别确定为一个结构样本,并将结构相同的结构样本整合成同一集合,得到多个第一集合,对于每个第一集合,所述构建单元112基于区域信息计算每个结构样本的哈希值,进一步地,所述构建单元112根据所述哈希值计算所述第一集合中任意两个结构样本的相似度,并将相似度为1的任意两个结构样本确定为目标样本对,所述构建单元112对比所述目标样本对的论文类别是否相同,若所述目标样本对的论文类别不同,从所述第一集合中删除所述目标样本对,得到第二集合,更进一步地,所述构建单元112计算所述第二集合中每个论文类别的结构样本的数量,并对比所述数量是否小于预设阈值,当所述数量小于所述预设阈值时,所述构建单元112通过扰动法增加与所述数量对应的论文类别的结构样本的数量,直至结构样本的数量大于或者等于所述预设阈值,得到所述训练数据集。
训练单元113分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型。
在本申请的至少一个实施例中,所述训练单元113分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型包括:
对于每个训练数据集,所述训练单元113从所述训练数据集中随机选取训练样本、测试样本及验证样本,并对所述训练样本中的样本信息进行分词,得到每个样本信息的词组,进一步地,所述训练单元113对所述词组进行one-hot编码,得到所述词组的编码向量,所述训练单元113根据所述词组在样本信息中的位置编号生成所述词组的位置向量,更进一步地,所述训练单元113拼接所述词组的编码向量及所述词组的位置向量,得到每个样本信息的所述词组的特征向量,并基于RoBerta技术对每个样本信息的所述特征向量进行建模,得到学习器,所述训练单元113将所述测试样本输入所述学习器中,并计算通过测试的测试样本的 测试占比,当所述测试占比小于目标值时,所述训练单元113根据所述验证样本调整所述学习器,得到所述分类模型。
在本申请的至少一个实施例中,所述训练单元113根据所述验证样本调整所述学习器,得到所述分类模型包括:
所述训练单元113采用超参数网格搜索方法从所述验证样本中获取最优超参数点,进一步地,所述训练单元113通过所述最优超参数点对所述学习器进行调整,得到所述分类模型。
具体地,所述训练单元113将所述验证样本按照固定步长进行拆分,得到目标子集,遍历所述目标子集上两端端点的参数,通过所述两端端点的参数验证所述学习器,得到每个参数的学习率,将所述学习率最好的参数确定为第一超参数点,并在所述第一超参数点的邻域内,缩小所述步长继续遍历,直至所述步长为预设步长,即得到的超参数点为所述最优超参数点,更进一步地,所述训练单元113根据所述最优超参数点调整所述学习器,得到所述分类模型。
其中,本申请对所述预设步长不作限制。
通过上述实施方式,能够得到较为精确的所述分类模型。
提取单元114获取待分类论文,并根据所述多个结构从所述待分类论文中提取文本信息,得到每个结构对应的文本信息。
在本申请的至少一个实施例中,所述待分类论文可以从用户触发的请求中获取。
进一步地,所述提取单元114根据所述多个结构从所述待分类论文中提取文本信息的方式与提取所述区域信息的方式相同,本申请对此不再赘述。
预处理单元115对每个结构对应的文本信息进行预处理,得到每个结构的输入信息。
在本申请的至少一个实施例中,所述预处理单元115对每个结构对应的文本信息进行预处理,得到每个结构的输入信息包括:
对于每个结构对应的文本信息,所述预处理单元115根据预设词典中的预设词组对所述文本信息进行切分,得到切分位置,并以所述切分位置构建至少一个有向无环图,所述预处理单元115根据所述预设词典中的权值计算每个有向无环图的概率,并将概率最大的有向无环图对应的切分位置确定为目标切分位置,所述预处理单元115根据所述目标切分位置确定第一分词,进一步地,所述预处理单元115根据停用词表过滤所述第一分词中的停用词,得到第二分词,所述预处理单元115计算每个第二分词在所述训练数据集中的占比,并删除占比大于配置值的第二分词,得到第三分词,所述预处理单元115计算所述第三分词在所述文本信息中的词频,并依据词频的高低从高到低对所述第三分词进行排序,得到队列,所述预处理单元115从所述队列中选取前N个字符作为所述输入信息,所述N为大于0的正整数。
输入单元116将每个结构的输入信息分别输入对应的分类模型中,得到每个分类模型对所述待分类论文的预测结果。
在本申请的至少一个实施例中,在将每个结构的输入信息分别输入对应的分类模型后,所述输入单元116会基于所述输入信息对所述分类模型进行运算,并将运算后的结果作为所述预测结果。
确定单元117将数量最多的预测结果确定为所述待分类论文的目标结果。
在本申请的至少一个实施例中,当每个分类模型对所述待分类论文的预测结果一致时,录入单元118将所述待分类论文及所述目标结果录入所述论文样本集。
在本申请的至少一个实施例中,当数量最多的预测结果有多个时,所述确定单元117将数量最多的预测结果确定为待确定结果,进一步地,所述确定单元117获取目标样本,采用所述目标样本测试每个分类模型,并计算每个模型通过测试的目标样本的目标占比,所述确定单元117将每个分类模型的目标占比作为每个分类模型的权值,根据每个权值对每个待确定结果进行加权和运算,得到每个待确定结果的预测分数,进一步地,所述确定单元117将预测分数最高的待确定结果确定为所述目标结果。
例如:标题分类模型的预测结果为A结果、摘要分类模型的预测结果为B结果、简介分类模型的预测结果为C结果、相关工作分类模型的预测结果为A结果、文章主文分类模型的预测结果为B结果、实验结果分类模型的预测结果为C结果、文章结论分类模型的预测结果为A结果、参考文献分类模型的预测结果为B结果,经计算,得到预测结果为A结果的数量有3个,得到预测结果为B结果的数量有3个,得到预测结果为C结果的数量有2个,得到数量最多的预测结果有2个,即待确定结果分别为A结果和B结果,获取目标样本,采用目标样本测试每个分类模型,得到标题分类模型的目标占比为0.8、摘要分类模型的目标占比为0.6、简介分类模型的目标占比为0.5、相关工作分类模型的目标占比为0.8、文章主文分类模型的目标占比为0.4、实验结果分类模型的目标占比为0.7、文章结论分类模型的目标占比为0.8、参考文献分类模型的目标占比为0.9,将每个目标占比作为每个权值,经计算,得到每个待确定结果的预测分数,A结果的预测分数为:0.8+0.8+0.8=2.4,B结果的预测分数为:0.6+0.4+0.9=1.9,得到A结果的预测分数最高,将A结果确定为所述目标结果。
由以上技术方案可以看出,本申请通过多个分类模型对所述待分类论文进行预测,能够得到准确的目标结果。
如图3所示,是本申请实现基于分类模型的论文分类方法的较佳实施例的电子设备的结构示意图。
在本申请的一个实施例中,所述电子设备1包括,但不限于,存储器12、处理器13,以及存储在所述存储器12中并可在所述处理器13上运行的计算机程序,例如基于分类模型的论文分类程序。
本领域技术人员可以理解,所述示意图仅仅是电子设备1的示例,并不构成对电子设备1的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述电子设备1还可以包括输入输出设备、网络接入设备、总线等。
所述处理器13可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,所述处理器13是所述电子设备1的运算核心和控制中心,利用各种接口和线路连接整个电子设备1的各个部分,及执行所述电子设备1的操作系统以及安装的各类应用程序、程序代码等。
所述处理器13执行所述电子设备1的操作系统以及安装的各类应用程序。所述处理器13执行所述应用程序以实现上述各个基于分类模型的论文分类方法实施例中的步骤,例如图1所示的步骤。
示例性的,所述计算机程序可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器12中,并由所述处理器13执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序在所述电子设备1中的执行过程。例如,所述计算机程序可以被分割成获取单元110、处理单元111、构建单元112、训练单元113、提取单元114、预处理单元115、输入单元116、确定单元117及录入单元118。
所述存储器12可用于存储所述计算机程序和/或模块,所述处理器13通过运行或执行存储在所述存储器12内的计算机程序和/或模块,以及调用存储在存储器12内的数据,实现所述电子设备1的各种功能。所述存储器12可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据等。此外,存储器12可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器 件、或其他非易失性固态存储器件。
所述存储器12可以是电子设备1的外部存储器和/或内部存储器。进一步地,所述存储器12可以是具有实物形式的存储器,如内存条、TF卡(Trans-flash Card)等等。
所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中,所述计算机可读存储介质可以是非易失性的存储介质,也可以是易失性的存储介质。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。
其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。
结合图1,所述电子设备1中的所述存储器12存储多个指令以实现一种基于分类模型的论文分类方法,所述处理器13可执行所述多个指令从而实现:获取论文样本集;对所述论文样本集中所有论文样本的文档信息进行结构化处理,得到每个论文样本的多个结构及每个结构对应的区域信息;基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集;分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型;获取待分类论文,并根据所述多个结构从所述待分类论文中提取文本信息,得到每个结构对应的文本信息;对每个结构对应的文本信息进行预处理,得到每个结构的输入信息;将每个结构的输入信息分别输入对应的分类模型中,得到每个分类模型对所述待分类论文的预测结果;将数量最多的预测结果确定为所述待分类论文的目标结果。
具体地,所述处理器13对上述指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。

Claims (20)

  1. 一种基于分类模型的论文分类方法,其中,所述基于分类模型的论文分类方法包括:
    获取论文样本集;
    对所述论文样本集中所有论文样本的文档信息进行结构化处理,得到每个论文样本的多个结构及每个结构对应的区域信息;
    基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集;
    分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型;
    获取待分类论文,并根据所述多个结构从所述待分类论文中提取文本信息,得到每个结构对应的文本信息;
    对每个结构对应的文本信息进行预处理,得到每个结构的输入信息;
    将每个结构的输入信息分别输入对应的分类模型中,得到每个分类模型对所述待分类论文的预测结果;
    将数量最多的预测结果确定为所述待分类论文的目标结果。
  2. 根据权利要求1所述的基于分类模型的论文分类方法,其中,所述获取论文样本集包括:
    基于光学字符识别技术对纸质版论文书籍进行扫描识别,得到扫描到的电子版论文文档;
    基于网络爬虫技术从预设网站上爬取电子版论文文档,得到爬取到的电子版论文文档;
    对扫描到的电子版论文文档及爬取到的电子版论文文档进行清洗处理,得到论文样本;
    将所述论文样本集合成所述论文样本集。
  3. 根据权利要求2所述的基于分类模型的论文分类方法,其中,所述对扫描到的电子版论文文档及爬取到的电子版论文文档进行清洗处理,得到论文样本包括:
    遍历所有电子版论文文档中的文本信息;
    当遍历到的文本信息为乱码时,将文本信息为乱码的电子版论文文档删除,并将保留的电子版论文文档作为目标论文文档;
    根据所述目标论文文档中每个论文文档的标题,计算每个目标论文文档的哈希值;
    从每个目标论文文档中抽取预设特征并建立特征索引;
    根据每两个目标论文文档的哈希值,采用余弦距离公式计算每两个目标论文文档的相似距离,得到每个论文文档对的相似距离,其中,每个论文文档对包括任意两个目标论文文档;
    通过所述特征索引搜索出相似距离大于预设值的论文文档对,并将该论文文档对确定为相似论文文档对;
    判断所述相似论文文档对中的预设特征是否相同;
    当所述相似论文文档对中的预设特征相同时,删除所述相似论文文档对中的任意一个论文文档,并将保留的论文文档确定为所述论文样本。
  4. 根据权利要求1所述的基于分类模型的论文分类方法,其中,所述基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集包括:
    将每个结构、每个结构对应的区域信息及每个区域信息对应的论文类别确定为一个结构样本;
    将结构相同的结构样本整合成同一集合,得到多个第一集合;
    对于每个第一集合,基于区域信息计算每个结构样本的哈希值;
    根据所述哈希值计算所述第一集合中任意两个结构样本的相似度,并将相似度为1的任意两个结构样本确定为目标样本对;
    对比所述目标样本对的论文类别是否相同,若所述目标样本对的论文类别不同,从所述第一集合中删除所述目标样本对,得到第二集合;
    计算所述第二集合中每个论文类别的结构样本的数量,并对比所述数量是否小于预设阈值;
    当所述数量小于所述预设阈值时,通过扰动法增加与所述数量对应的论文类别的结构样本的数量,直至结构样本的数量大于或者等于所述预设阈值,得到所述训练数据集。
  5. 根据权利要求1所述的基于分类模型的论文分类方法,其中,所述分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型包括:
    对于每个训练数据集,从所述训练数据集中随机选取训练样本、测试样本及验证样本;
    对所述训练样本中的样本信息进行分词,得到每个样本信息的词组;
    对所述词组进行one-hot编码,得到所述词组的编码向量;
    根据所述词组在样本信息中的位置编号生成所述词组的位置向量;
    拼接所述词组的编码向量及所述词组的位置向量,得到每个样本信息的所述词组的特征向量;
    基于RoBerta技术对每个样本信息的所述特征向量进行建模,得到学习器;
    将所述测试样本输入所述学习器中,并计算通过测试的测试样本的测试占比;
    当所述测试占比小于目标值时,根据所述验证样本调整所述学习器,得到所述分类模型。
  6. 根据权利要求1所述的基于分类模型的论文分类方法,其中,所述对每个结构对应的文本信息进行预处理,得到每个结构的输入信息包括:
    对于每个结构对应的文本信息,根据预设词典中的预设词组对所述文本信息进行切分,得到切分位置;
    以所述切分位置构建至少一个有向无环图;
    根据所述预设词典中的权值计算每个有向无环图的概率;
    将概率最大的有向无环图对应的切分位置确定为目标切分位置;
    根据所述目标切分位置确定第一分词;
    根据停用词表过滤所述第一分词中的停用词,得到第二分词;
    计算每个第二分词在所述训练数据集中的占比;
    删除占比大于配置值的第二分词,得到第三分词;
    计算所述第三分词在所述文本信息中的词频,并依据词频的高低从高到低对所述第三分词进行排序,得到队列;
    从所述队列中选取前N个字符作为所述输入信息,所述N为大于0的正整数。
  7. 根据权利要求1所述的基于分类模型的论文分类方法,其中,所述论文样本集存储于区块链中,所述方法还包括:
    当每个分类模型对所述待分类论文的预测结果一致时,将所述待分类论文及所述目标结果录入所述论文样本集;或者
    当数量最多的预测结果有多个时,将数量最多的预测结果确定为待确定结果,获取目标样本,采用所述目标样本测试每个分类模型,并计算每个模型通过测试的目标样本的目标占比,将每个分类模型的目标占比作为每个分类模型的权值,根据每个权值对每个待确定结果进行加权和运算,得到每个待确定结果的预测分数,将预测分数最高的待确定结果确定为所述目标结果。
  8. 一种电子设备,其中,所述电子设备包括处理器和存储器,所述处理器用于执行存储器中存储的至少一个计算机可读指令以实现以下步骤:
    获取论文样本集;
    对所述论文样本集中所有论文样本的文档信息进行结构化处理,得到每个论文样本的多个结构及每个结构对应的区域信息;
    基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集;
    分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型;
    获取待分类论文,并根据所述多个结构从所述待分类论文中提取文本信息,得到每个结构对应的文本信息;
    对每个结构对应的文本信息进行预处理,得到每个结构的输入信息;
    将每个结构的输入信息分别输入对应的分类模型中,得到每个分类模型对所述待分类论文的预测结果;
    将数量最多的预测结果确定为所述待分类论文的目标结果。
  9. 根据权利要求8所述的电子设备,其中,在所述获取论文样本集时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:
    基于光学字符识别技术对纸质版论文书籍进行扫描识别,得到扫描到的电子版论文文档;
    基于网络爬虫技术从预设网站上爬取电子版论文文档,得到爬取到的电子版论文文档;
    对扫描到的电子版论文文档及爬取到的电子版论文文档进行清洗处理,得到论文样本;
    将所述论文样本集合成所述论文样本集。
  10. 根据权利要求9所述的电子设备,其中,在所述对扫描到的电子版论文文档及爬取到的电子版论文文档进行清洗处理,得到论文样本时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:
    遍历所有电子版论文文档中的文本信息;
    当遍历到的文本信息为乱码时,将文本信息为乱码的电子版论文文档删除,并将保留的电子版论文文档作为目标论文文档;
    根据所述目标论文文档中每个论文文档的标题,计算每个目标论文文档的哈希值;
    从每个目标论文文档中抽取预设特征并建立特征索引;
    根据每两个目标论文文档的哈希值,采用余弦距离公式计算每两个目标论文文档的相似距离,得到每个论文文档对的相似距离,其中,每个论文文档对包括任意两个目标论文文档;
    通过所述特征索引搜索出相似距离大于预设值的论文文档对,并将该论文文档对确定为相似论文文档对;
    判断所述相似论文文档对中的预设特征是否相同;
    当所述相似论文文档对中的预设特征相同时,删除所述相似论文文档对中的任意一个论文文档,并将保留的论文文档确定为所述论文样本。
  11. 根据权利要求8所述的电子设备,其中,在所述基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集时,所述处理器执行所述至少一个计算机可读指令还用以实现以下步骤:
    将每个结构、每个结构对应的区域信息及每个区域信息对应的论文类别确定为一个结构样本;
    将结构相同的结构样本整合成同一集合,得到多个第一集合;
    对于每个第一集合,基于区域信息计算每个结构样本的哈希值;
    根据所述哈希值计算所述第一集合中任意两个结构样本的相似度,并将相似度为1的任意两个结构样本确定为目标样本对;
    对比所述目标样本对的论文类别是否相同,若所述目标样本对的论文类别不同,从所述第一集合中删除所述目标样本对,得到第二集合;
    计算所述第二集合中每个论文类别的结构样本的数量,并对比所述数量是否小于预设阈值;
    当所述数量小于所述预设阈值时,通过扰动法增加与所述数量对应的论文类别的结构样本的数量,直至结构样本的数量大于或者等于所述预设阈值,得到所述训练数据集。
  12. 根据权利要求8所述的电子设备,其中,在所述分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型时,所述处理器执行所述至少一个计算机可读指令以 实现以下步骤:
    对于每个训练数据集,从所述训练数据集中随机选取训练样本、测试样本及验证样本;
    对所述训练样本中的样本信息进行分词,得到每个样本信息的词组;
    对所述词组进行one-hot编码,得到所述词组的编码向量;
    根据所述词组在样本信息中的位置编号生成所述词组的位置向量;
    拼接所述词组的编码向量及所述词组的位置向量,得到每个样本信息的所述词组的特征向量;
    基于RoBerta技术对每个样本信息的所述特征向量进行建模,得到学习器;
    将所述测试样本输入所述学习器中,并计算通过测试的测试样本的测试占比;
    当所述测试占比小于目标值时,根据所述验证样本调整所述学习器,得到所述分类模型。
  13. 根据权利要求8所述的电子设备,其中,在所述对每个结构对应的文本信息进行预处理,得到每个结构的输入信息时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:
    对于每个结构对应的文本信息,根据预设词典中的预设词组对所述文本信息进行切分,得到切分位置;
    以所述切分位置构建至少一个有向无环图;
    根据所述预设词典中的权值计算每个有向无环图的概率;
    将概率最大的有向无环图对应的切分位置确定为目标切分位置;
    根据所述目标切分位置确定第一分词;
    根据停用词表过滤所述第一分词中的停用词,得到第二分词;
    计算每个第二分词在所述训练数据集中的占比;
    删除占比大于配置值的第二分词,得到第三分词;
    计算所述第三分词在所述文本信息中的词频,并依据词频的高低从高到低对所述第三分词进行排序,得到队列;
    从所述队列中选取前N个字符作为所述输入信息,所述N为大于0的正整数。
  14. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行时实现以下步骤:
    获取论文样本集;
    对所述论文样本集中所有论文样本的文档信息进行结构化处理,得到每个论文样本的多个结构及每个结构对应的区域信息;
    基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集;
    分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型;
    获取待分类论文,并根据所述多个结构从所述待分类论文中提取文本信息,得到每个结构对应的文本信息;
    对每个结构对应的文本信息进行预处理,得到每个结构的输入信息;
    将每个结构的输入信息分别输入对应的分类模型中,得到每个分类模型对所述待分类论文的预测结果;
    将数量最多的预测结果确定为所述待分类论文的目标结果。
  15. 根据权利要求14所述的存储介质,其中,在所述获取论文样本集时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:
    基于光学字符识别技术对纸质版论文书籍进行扫描识别,得到扫描到的电子版论文文档;
    基于网络爬虫技术从预设网站上爬取电子版论文文档,得到爬取到的电子版论文文档;
    对扫描到的电子版论文文档及爬取到的电子版论文文档进行清洗处理,得到论文样本;
    将所述论文样本集合成所述论文样本集。
  16. 根据权利要求15所述的存储介质,其中,在所述对扫描到的电子版论文文档及爬取到的电子版论文文档进行清洗处理,得到论文样本时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:
    遍历所有电子版论文文档中的文本信息;
    当遍历到的文本信息为乱码时,将文本信息为乱码的电子版论文文档删除,并将保留的电子版论文文档作为目标论文文档;
    根据所述目标论文文档中每个论文文档的标题,计算每个目标论文文档的哈希值;
    从每个目标论文文档中抽取预设特征并建立特征索引;
    根据每两个目标论文文档的哈希值,采用余弦距离公式计算每两个目标论文文档的相似距离,得到每个论文文档对的相似距离,其中,每个论文文档对包括任意两个目标论文文档;
    通过所述特征索引搜索出相似距离大于预设值的论文文档对,并将该论文文档对确定为相似论文文档对;
    判断所述相似论文文档对中的预设特征是否相同;
    当所述相似论文文档对中的预设特征相同时,删除所述相似论文文档对中的任意一个论文文档,并将保留的论文文档确定为所述论文样本。
  17. 根据权利要求14所述的存储介质,其中,在所述基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集时,所述至少一个计算机可读指令被处理器执行还用以实现以下步骤:
    将每个结构、每个结构对应的区域信息及每个区域信息对应的论文类别确定为一个结构样本;
    将结构相同的结构样本整合成同一集合,得到多个第一集合;
    对于每个第一集合,基于区域信息计算每个结构样本的哈希值;
    根据所述哈希值计算所述第一集合中任意两个结构样本的相似度,并将相似度为1的任意两个结构样本确定为目标样本对;
    对比所述目标样本对的论文类别是否相同,若所述目标样本对的论文类别不同,从所述第一集合中删除所述目标样本对,得到第二集合;
    计算所述第二集合中每个论文类别的结构样本的数量,并对比所述数量是否小于预设阈值;
    当所述数量小于所述预设阈值时,通过扰动法增加与所述数量对应的论文类别的结构样本的数量,直至结构样本的数量大于或者等于所述预设阈值,得到所述训练数据集。
  18. 根据权利要求14所述的存储介质,其中,在所述分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型时,所述至少一个计算机可读指令被处理器执行时还用以实现以下步骤:
    对于每个训练数据集,从所述训练数据集中随机选取训练样本、测试样本及验证样本;
    对所述训练样本中的样本信息进行分词,得到每个样本信息的词组;
    对所述词组进行one-hot编码,得到所述词组的编码向量;
    根据所述词组在样本信息中的位置编号生成所述词组的位置向量;
    拼接所述词组的编码向量及所述词组的位置向量,得到每个样本信息的所述词组的特征向量;
    基于RoBerta技术对每个样本信息的所述特征向量进行建模,得到学习器;
    将所述测试样本输入所述学习器中,并计算通过测试的测试样本的测试占比;
    当所述测试占比小于目标值时,根据所述验证样本调整所述学习器,得到所述分类模型。
  19. 根据权利要求14所述的存储介质,其中,在所述对每个结构对应的文本信息进行预处理,得到每个结构的输入信息时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:
    对于每个结构对应的文本信息,根据预设词典中的预设词组对所述文本信息进行切分,得到切分位置;
    以所述切分位置构建至少一个有向无环图;
    根据所述预设词典中的权值计算每个有向无环图的概率;
    将概率最大的有向无环图对应的切分位置确定为目标切分位置;
    根据所述目标切分位置确定第一分词;
    根据停用词表过滤所述第一分词中的停用词,得到第二分词;
    计算每个第二分词在所述训练数据集中的占比;
    删除占比大于配置值的第二分词,得到第三分词;
    计算所述第三分词在所述文本信息中的词频,并依据词频的高低从高到低对所述第三分词进行排序,得到队列;
    从所述队列中选取前N个字符作为所述输入信息,所述N为大于0的正整数。
  20. 一种基于分类模型的论文分类装置,其中,所述基于分类模型的论文分类装置包括:
    获取单元,用于获取论文样本集;
    处理单元,用于对所述论文样本集中所有论文样本的文档信息进行结构化处理,得到每个论文样本的多个结构及每个结构对应的区域信息;
    构建单元,用于基于每个论文样本的多个结构及每个结构对应的区域信息,构建与每个结构对应的训练数据集;
    训练单元,用于分别训练每个训练数据集中的训练样本,得到与每个结构对应的分类模型;
    提取单元,用于获取待分类论文,并根据所述多个结构从所述待分类论文中提取文本信息,得到每个结构对应的文本信息;
    预处理单元,用于对每个结构对应的文本信息进行预处理,得到每个结构的输入信息;
    输入单元,用于将每个结构的输入信息分别输入对应的分类模型中,得到每个分类模型对所述待分类论文的预测结果;
    确定单元,用于将数量最多的预测结果确定为所述待分类论文的目标结果。
PCT/CN2020/105627 2020-04-30 2020-07-29 基于分类模型的论文分类方法、装置、电子设备及介质 WO2021217930A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010368034.1A CN111639181A (zh) 2020-04-30 2020-04-30 基于分类模型的论文分类方法、装置、电子设备及介质
CN202010368034.1 2020-04-30

Publications (1)

Publication Number Publication Date
WO2021217930A1 true WO2021217930A1 (zh) 2021-11-04

Family

ID=72330926

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/105627 WO2021217930A1 (zh) 2020-04-30 2020-07-29 基于分类模型的论文分类方法、装置、电子设备及介质

Country Status (2)

Country Link
CN (1) CN111639181A (zh)
WO (1) WO2021217930A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114238644A (zh) * 2022-02-22 2022-03-25 北京澜舟科技有限公司 一种降低语义识别计算量的方法、系统及存储介质
CN114254622A (zh) * 2021-12-10 2022-03-29 马上消费金融股份有限公司 一种意图识别方法和装置
CN114548261A (zh) * 2022-02-18 2022-05-27 北京百度网讯科技有限公司 数据处理方法、装置、电子设备以及存储介质
CN114691875A (zh) * 2022-04-22 2022-07-01 光大科技有限公司 一种数据分类分级处理方法及装置
CN114969725A (zh) * 2022-04-18 2022-08-30 中移互联网有限公司 目标命令识别方法、装置、电子设备及可读存储介质
CN115203357A (zh) * 2022-07-27 2022-10-18 海南绿境高科环保有限公司 一种信息检索及信息索引更新方法、装置、设备及介质
CN115562979A (zh) * 2022-09-27 2023-01-03 上海艾柯检测科技有限公司 一种基于人工智能的自动生成测试报告的方法
CN117991689A (zh) * 2024-01-09 2024-05-07 北京国联视讯信息技术股份有限公司 基于数字孪生的工业大数据模拟方法及系统

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214515B (zh) * 2020-10-16 2024-07-05 深圳赛安特技术服务有限公司 数据自动匹配方法、装置、电子设备及存储介质
CN112417147A (zh) * 2020-11-05 2021-02-26 腾讯科技(深圳)有限公司 训练样本的选取方法与装置
CN112099739B (zh) * 2020-11-10 2021-02-23 大象慧云信息技术有限公司 一种纸质发票分类批量打印方法及系统
CN112613555A (zh) * 2020-12-21 2021-04-06 深圳壹账通智能科技有限公司 基于元学习的目标分类方法、装置、设备和存储介质
CN113064973A (zh) * 2021-04-12 2021-07-02 平安国际智慧城市科技股份有限公司 文本分类方法、装置、设备及存储介质
CN117520754B (zh) * 2024-01-05 2024-04-12 北京睿企信息科技有限公司 一种模型训练数据的预处理系统
CN118366175B (zh) * 2024-06-19 2024-09-24 湖北微模式科技发展有限公司 一种基于字频的文档图像分类方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070239638A1 (en) * 2006-03-20 2007-10-11 Microsoft Corporation Text classification by weighted proximal support vector machine
CN105740329A (zh) * 2016-01-21 2016-07-06 浙江万里学院 一种非结构化大数据流的内容语义挖掘方法
CN109815335A (zh) * 2019-01-26 2019-05-28 福州大学 一种适用于文献网络的论文领域分类方法
CN110162797A (zh) * 2019-06-21 2019-08-23 北京百度网讯科技有限公司 文章质量检测方法和装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063472B (zh) * 2014-06-30 2017-02-15 电子科技大学 一种优化训练样本集的knn文本分类方法
US10832003B2 (en) * 2018-08-26 2020-11-10 CloudMinds Technology, Inc. Method and system for intent classification
CN109471937A (zh) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 一种基于机器学习的文本分类方法及终端设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070239638A1 (en) * 2006-03-20 2007-10-11 Microsoft Corporation Text classification by weighted proximal support vector machine
CN105740329A (zh) * 2016-01-21 2016-07-06 浙江万里学院 一种非结构化大数据流的内容语义挖掘方法
CN109815335A (zh) * 2019-01-26 2019-05-28 福州大学 一种适用于文献网络的论文领域分类方法
CN110162797A (zh) * 2019-06-21 2019-08-23 北京百度网讯科技有限公司 文章质量检测方法和装置

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114254622A (zh) * 2021-12-10 2022-03-29 马上消费金融股份有限公司 一种意图识别方法和装置
CN114548261A (zh) * 2022-02-18 2022-05-27 北京百度网讯科技有限公司 数据处理方法、装置、电子设备以及存储介质
CN114238644A (zh) * 2022-02-22 2022-03-25 北京澜舟科技有限公司 一种降低语义识别计算量的方法、系统及存储介质
CN114969725A (zh) * 2022-04-18 2022-08-30 中移互联网有限公司 目标命令识别方法、装置、电子设备及可读存储介质
CN114691875A (zh) * 2022-04-22 2022-07-01 光大科技有限公司 一种数据分类分级处理方法及装置
CN115203357A (zh) * 2022-07-27 2022-10-18 海南绿境高科环保有限公司 一种信息检索及信息索引更新方法、装置、设备及介质
CN115562979A (zh) * 2022-09-27 2023-01-03 上海艾柯检测科技有限公司 一种基于人工智能的自动生成测试报告的方法
CN115562979B (zh) * 2022-09-27 2023-04-25 上海艾柯检测科技有限公司 一种基于人工智能的自动生成测试报告的方法
CN117991689A (zh) * 2024-01-09 2024-05-07 北京国联视讯信息技术股份有限公司 基于数字孪生的工业大数据模拟方法及系统

Also Published As

Publication number Publication date
CN111639181A (zh) 2020-09-08

Similar Documents

Publication Publication Date Title
WO2021217930A1 (zh) 基于分类模型的论文分类方法、装置、电子设备及介质
CN111695033B (zh) 企业舆情分析方法、装置、电子设备及介质
WO2021051518A1 (zh) 基于神经网络模型的文本数据分类方法、装置及存储介质
Vogels et al. Web2text: Deep structured boilerplate removal
Smith et al. Detecting and modeling local text reuse
KR20180011254A (ko) 웹페이지 트레이닝 방법 및 기기, 그리고 검색 의도 식별 방법 및 기기
CN107315797A (zh) 一种网络新闻获取及文本情感预测系统
WO2020243846A1 (en) System and method for automated file reporting
CN108090216B (zh) 一种标签预测方法、装置及存储介质
CN104239553A (zh) 一种基于Map-Reduce框架的实体识别方法
CN110188195A (zh) 一种基于深度学习的文本意图识别方法、装置及设备
CN113486664A (zh) 文本数据可视化分析方法、装置、设备及存储介质
CN111797247B (zh) 基于人工智能的案件推送方法、装置、电子设备及介质
CN111538903B (zh) 搜索推荐词确定方法、装置、电子设备及计算机可读介质
CN104572720A (zh) 一种网页信息排重的方法、装置及计算机可读存储介质
Fan et al. Classification acceleration via merging decision trees
Assaf et al. Dataset for arabic fake news
Siddiqui et al. Analyzing the potential of zero-shot recognition for document image classification
Zhang et al. Learning hash codes for efficient content reuse detection
CN108470035A (zh) 一种基于判别混合模型的实体-引文相关性分类方法
Yang et al. Camera-based piano sheet music identification
Bondoc et al. An intelligent road traffic information system using text analysis in the most congested roads in Metro Manila
CN116204647A (zh) 一种目标比对学习模型的建立、文本聚类方法及装置
Huang et al. Achieving both high precision and high recall in near-duplicate detection
CN114818686A (zh) 基于人工智能的文本推荐方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20933543

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.03.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20933543

Country of ref document: EP

Kind code of ref document: A1