CN117114103A - Corpus reconstruction method and device - Google Patents
Corpus reconstruction method and device Download PDFInfo
- Publication number
- CN117114103A CN117114103A CN202311360974.6A CN202311360974A CN117114103A CN 117114103 A CN117114103 A CN 117114103A CN 202311360974 A CN202311360974 A CN 202311360974A CN 117114103 A CN117114103 A CN 117114103A
- Authority
- CN
- China
- Prior art keywords
- corpus
- knowledge base
- sample data
- data
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 239000011159 matrix material Substances 0.000 claims abstract description 23
- 238000012549 training Methods 0.000 claims description 43
- 238000013145 classification model Methods 0.000 claims description 18
- 238000012360 testing method Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 12
- 230000003321 amplification Effects 0.000 claims description 3
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 5
- 238000013473 artificial intelligence Methods 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 7
- 238000013136 deep learning model Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000012423 maintenance Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 241000282326 Felis catus Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/042—Knowledge-based neural networks; Logical representations of neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of artificial intelligence, and particularly provides a method and a device for reconstructing a corpus, wherein the method comprises the following steps: predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result; determining a confusion matrix corresponding to the corpus based on the prediction result; determining the confusion degree between knowledge base names in the corpus based on the confusion matrix; and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus. The technical scheme provided by the invention can reconstruct and optimize the knowledge base by an automatic discrimination technology, ensures the reliability of the corpus, can discriminate the knowledge of the unknown corpus and expands the existing knowledge base.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for reconstructing a corpus.
Background
In the artificial intelligence era, the mainstream algorithm models are trained by relying on data. Thus, the quality of the corpus used to train the deep learning model is critical to the performance and reliability of the model. The valuable manifestation of high quality data is mainly in terms of ensuring accuracy and generalization of the model: 1) Deep learning models require a large amount of diverse and representative data to perform well. The high-quality corpus enables the model to learn a complex mode and make accurate predictions. Conversely, if the training data is of low quality, including errors, bias, or lack of variation, the model may produce inaccurate or biased results. 2) The deep learning model learns knowledge through training data and applies it to the unseen examples is a generalized embodiment. By exposing the model to diversified and high quality data, it can better understand and adapt to new, unknown examples. The effective generalization ability is critical to the performance of the model in real scenes.
The construction and maintenance of the corpus at present mainly depend on manual labeling, which not only increases the cost of data maintenance, but also reduces the corpus quality due to the influence of professional knowledge and state of people, thereby reducing the corpus reliability.
Some existing automatic corpus expansion technologies mainly adopt word replacement mode masks for corpus expansion, and the corpus expansion mode tends to have insufficient corpus quality. Or the pre-trained model is adopted to mark the introduced new knowledge and combine the new knowledge with the existing knowledge system, however, in order to ensure the accuracy of the knowledge, a large model is required to be adopted, so that enough data must be provided, more marking resources are required to be provided, and the model is often not known about knowledge fields in various fields in practice, so that the adaptability is poor.
Disclosure of Invention
In order to overcome the defects, the invention provides a corpus reconstruction method and device.
In a first aspect, a method for reconstructing a corpus is provided, where the method for reconstructing the corpus includes:
predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result;
determining a confusion matrix corresponding to the corpus based on the prediction result;
determining the confusion degree between knowledge base names in the corpus based on the confusion matrix;
and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus.
Preferably, the training process of the pre-trained prediction model includes:
splitting sample data in a corpus into a plurality of sample data sets;
the initial predictive model is trained using the plurality of sample data sets.
Further, the training the initial predictive model using the plurality of sample data sets includes:
step 1, initializing k=1;
step 2, training an initial prediction model by taking a kth sample data set as test data and taking the rest sample data sets in the plurality of sample data sets as training data;
step 3, judging whether K is equal to K, if yes, outputting a prediction model, otherwise, enabling k=k+1 and returning to the step 2;
where K is the total number of sample data sets in the plurality of sample data sets.
Preferably, the first of the confusion matricesThe element value of the j-th column of the i row is B ij Wherein B is ij In order to predict sample data corresponding to the ith knowledge base name by using a prediction model, the number of sample data predicted to be the jth knowledge base name, i, j epsilon [1, N]N is the total number of knowledge base names in the corpus.
Further, the confusion degree between the knowledge base names in the corpus is score ij =B ij /∑ j∈[1,N] B ij Wherein, score ij Is the confusion between the ith knowledge base name and the jth knowledge base name in the corpus.
Preferably, the merging the sample data corresponding to each knowledge base name in the corpus based on the confusion degree between the knowledge base names in the corpus includes:
and when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus belongs to a preset range, merging sample data corresponding to the ith knowledge base name into sample data of the jth knowledge base name.
Further, the preset range is [0.1,0.5].
Preferably, the merging the sample data corresponding to each knowledge base name in the corpus based on the confusion degree between the knowledge base names in the corpus includes:
when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus exceeds a preset value, merging sample data corresponding to the ith knowledge base name with sample data of the jth knowledge base name, and defining a new knowledge base name by the obtained new sample data set.
Further, the preset value is 0.5.
Preferably, the method further comprises:
dividing sample data in a corpus into training data and test data according to a preset proportion;
training the initial prediction model by using the training data and the test data to obtain a data classification model;
and taking the data to be classified as the input of the data classification model, obtaining the knowledge base name corresponding to the data to be classified output by the data classification model, and merging the data to be classified into sample data of the corresponding knowledge base name.
Further, the initial prediction model is a TextCNN model, a DPCNN model or a HAN model.
In a second aspect, a device for reconstructing a corpus is provided, where the device for reconstructing a corpus includes:
the prediction module is used for predicting sample data in the corpus by utilizing a pre-trained prediction model to obtain a prediction result;
the first determining module is used for determining a confusion matrix corresponding to the corpus based on the prediction result;
the second determining module is used for determining the confusion degree among knowledge base names in the corpus based on the confusion matrix;
and the merging module is used for merging the sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus.
Preferably, the method further comprises:
the segmentation module is used for segmenting sample data in the corpus into training data and test data according to a preset proportion;
the training module is used for training the initial prediction model by utilizing the training data and the test data to obtain a data classification model;
and the amplification module is used for taking the data to be classified as the input of the data classification model, obtaining the knowledge base name corresponding to the data to be classified output by the data classification model, and merging the data to be classified into sample data of the corresponding knowledge base name.
In a third aspect, there is provided a computer device comprising: one or more processors;
the processor is used for storing one or more programs;
the method of corpus reconstruction is implemented when the one or more programs are executed by the one or more processors.
In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed, implements the method for reconstructing a corpus.
The technical scheme provided by the invention has at least one or more of the following beneficial effects:
the invention provides a method and a device for reconstructing a corpus, comprising the following steps: predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result; determining a confusion matrix corresponding to the corpus based on the prediction result; determining the confusion degree between knowledge base names in the corpus based on the confusion matrix; and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus. The technical scheme provided by the invention can reconstruct and optimize the knowledge base by an automatic discrimination technology, ensures the reliability of the corpus, can discriminate the knowledge of the unknown corpus and expands the existing knowledge base.
Drawings
FIG. 1 is a flow chart illustrating main steps of a corpus reconstruction method according to an embodiment of the present invention;
fig. 2 is a main block diagram of a corpus reconstruction device according to an embodiment of the present invention.
Detailed Description
The following describes the embodiments of the present invention in further detail with reference to the drawings.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As disclosed in the background, in the artificial intelligence era, the mainstream algorithmic model is trained on data. Thus, the quality of the corpus used to train the deep learning model is critical to the performance and reliability of the model. The valuable manifestation of high quality data is mainly in terms of ensuring accuracy and generalization of the model: 1) Deep learning models require a large amount of diverse and representative data to perform well. The high-quality corpus enables the model to learn a complex mode and make accurate predictions. Conversely, if the training data is of low quality, including errors, bias, or lack of variation, the model may produce inaccurate or biased results. 2) The deep learning model learns knowledge through training data and applies it to the unseen examples is a generalized embodiment. By exposing the model to diversified and high quality data, it can better understand and adapt to new, unknown examples. The effective generalization ability is critical to the performance of the model in real scenes.
The construction and maintenance of the corpus at present mainly depend on manual labeling, which not only increases the cost of data maintenance, but also reduces the corpus quality due to the influence of professional knowledge and state of people, thereby reducing the corpus reliability.
Some existing automatic corpus expansion technologies mainly adopt word replacement mode masks for corpus expansion, and the corpus expansion mode tends to have insufficient corpus quality. Or the pre-trained model is adopted to mark the introduced new knowledge and combine the new knowledge with the existing knowledge system, however, in order to ensure the accuracy of the knowledge, a large model is required to be adopted, so that enough data must be provided, more marking resources are required to be provided, and the model is often not known about knowledge fields in various fields in practice, so that the adaptability is poor.
In order to improve the above problems, the present invention provides a method and an apparatus for reconstructing a corpus, including: predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result; determining a confusion matrix corresponding to the corpus based on the prediction result; determining the confusion degree between knowledge base names in the corpus based on the confusion matrix; and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus. The technical scheme provided by the invention can reconstruct and optimize the knowledge base by an automatic discrimination technology, ensures the reliability of the corpus, can discriminate the knowledge of the unknown corpus and expands the existing knowledge base.
The above-described scheme is explained in detail below.
Example 1
Referring to fig. 1, fig. 1 is a schematic flow chart of main steps of a corpus reconstruction method according to an embodiment of the present invention. As shown in fig. 1, the method for reconstructing a corpus in the embodiment of the present invention mainly includes the following steps:
step S101: predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result;
step S102: determining a confusion matrix corresponding to the corpus based on the prediction result;
step S103: determining the confusion degree between knowledge base names in the corpus based on the confusion matrix;
step S104: and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus.
In this embodiment, the training process of the pre-trained prediction model includes:
splitting sample data in a corpus into a plurality of sample data sets;
the initial predictive model is trained using the plurality of sample data sets.
In one embodiment, the training of the initial predictive model using the plurality of sample data sets includes:
step 1, initializing k=1;
step 2, training an initial prediction model by taking a kth sample data set as test data and taking the rest sample data sets in the plurality of sample data sets as training data;
step 3, judging whether K is equal to K, if yes, outputting a prediction model, otherwise, enabling k=k+1 and returning to the step 2;
where K is the total number of sample data sets in the plurality of sample data sets.
For example: assuming k=3, the data is divided equally into three parts a, B, C, first a lightweight model is trained using the a+b data to obtain model (a+b), C is predicted using model (a+b), and since C does not appear in the training corpus of model (a+b), model (a+b) is reliable and objective for C prediction. By analogy, data A can be predicted from model (B+C). The whole knowledge base can be covered and predicted by repeating the method K times. In practice, the data size of K can be flexibly adjusted according to the size of the corpus and training resources, and any scene can be adapted.
In this embodiment, the element value of the ith row and the jth column of the confusion matrix is B ij Wherein B is ij In order to predict sample data corresponding to the ith knowledge base name by using a prediction model, the number of sample data predicted to be the jth knowledge base name, i, j epsilon [1, N]N is the total number of knowledge base names in the corpus.
For example: the confusion matrix for the three corpus of cat, dog, pig is as follows:
however, after obtaining the confusion matrix, how to judge whether the coincidence exists between different knowledge bases is also a problem because of the lack of relevant judgment marks. To solve this problem, the present solution proposes to measure the degree of confusion by using a confusion normalization technique. Specifically, the confusion degree between knowledge base names in the corpus is score ij =B ij /∑ j∈[1,N] B ij Wherein, score ij Is the confusion between the ith knowledge base name and the jth knowledge base name in the corpus.
In this embodiment, the merging, based on the confusion between the names of the knowledge bases in the corpus, the sample data corresponding to the names of the knowledge bases in the corpus includes:
and when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus belongs to a preset range, merging sample data corresponding to the ith knowledge base name into sample data of the jth knowledge base name.
In one embodiment, the predetermined range is [0.1,0.5].
In this embodiment, the merging, based on the confusion between the names of the knowledge bases in the corpus, the sample data corresponding to the names of the knowledge bases in the corpus includes:
when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus exceeds a preset value, merging sample data corresponding to the ith knowledge base name with sample data of the jth knowledge base name, and defining a new knowledge base name by the obtained new sample data set.
In one embodiment, the preset value is 0.5.
In this embodiment, the method further includes:
dividing sample data in a corpus into training data and test data according to a preset proportion;
training the initial prediction model by using the training data and the test data to obtain a data classification model;
and taking the data to be classified as the input of the data classification model, obtaining the knowledge base name corresponding to the data to be classified output by the data classification model, and merging the data to be classified into sample data of the corresponding knowledge base name.
For example, the data set is segmented according to the above combined and reconstructed data in an 8:2 mode, a model capable of understanding the information of the new combined data is retrained and used for sensing and identifying the new data, and the database can be amplified for the new data.
For a batch of new data, the new data can be labeled and classified through the model which is trained in the step 1 and perceives all the data, and the new data is merged into the existing knowledge base, and the knowledge is purified by adopting the scheme of the classifying threshold value to ensure the accuracy of knowledge, and the new knowledge system is merged after being manually combed out when the classifying threshold value is lower than 0.5 and is placed under the unknown classification (the meaning corpus does not belong to any knowledge base which is currently known).
In this embodiment, the initial prediction model is a TextCNN model, a DPCNN model, or a HAN model.
Example 2
Based on the same inventive concept, the invention also provides a device for reconstructing a corpus, as shown in fig. 2, wherein the device for reconstructing the corpus comprises:
the prediction module is used for predicting sample data in the corpus by utilizing a pre-trained prediction model to obtain a prediction result;
the first determining module is used for determining a confusion matrix corresponding to the corpus based on the prediction result;
the second determining module is used for determining the confusion degree among knowledge base names in the corpus based on the confusion matrix;
and the merging module is used for merging the sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus.
Preferably, the training process of the pre-trained prediction model includes:
splitting sample data in a corpus into a plurality of sample data sets;
the initial predictive model is trained using the plurality of sample data sets.
Further, the training the initial predictive model using the plurality of sample data sets includes:
step 1, initializing k=1;
step 2, training an initial prediction model by taking a kth sample data set as test data and taking the rest sample data sets in the plurality of sample data sets as training data;
step 3, judging whether K is equal to K, if yes, outputting a prediction model, otherwise, enabling k=k+1 and returning to the step 2;
where K is the total number of sample data sets in the plurality of sample data sets.
Preferably, the element value of the ith row and the jth column of the confusion matrix is B ij Wherein B is ij In order to predict sample data corresponding to the ith knowledge base name by using a prediction model, the number of sample data predicted to be the jth knowledge base name, i, j epsilon [1, N]N is the total number of knowledge base names in the corpus.
Further, the confusion degree between the knowledge base names in the corpus is score ij =B ij /∑ j∈[1,N] B ij Wherein, score ij Is the confusion between the ith knowledge base name and the jth knowledge base name in the corpus.
Preferably, the merging module is specifically configured to include:
and when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus belongs to a preset range, merging sample data corresponding to the ith knowledge base name into sample data of the jth knowledge base name.
Further, the preset range is [0.1,0.5].
Preferably, the merging module is specifically configured to include:
when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus exceeds a preset value, merging sample data corresponding to the ith knowledge base name with sample data of the jth knowledge base name, and defining a new knowledge base name by the obtained new sample data set.
Further, the preset value is 0.5.
Preferably, the method further comprises:
the segmentation module is used for segmenting sample data in the corpus into training data and test data according to a preset proportion;
the training module is used for training the initial prediction model by utilizing the training data and the test data to obtain a data classification model;
and the amplification module is used for taking the data to be classified as the input of the data classification model, obtaining the knowledge base name corresponding to the data to be classified output by the data classification model, and merging the data to be classified into sample data of the corresponding knowledge base name.
Further, the initial prediction model is a TextCNN model, a DPCNN model or a HAN model.
Example 3
Based on the same inventive concept, the invention also provides a computer device comprising a processor and a memory for storing a computer program comprising program instructions, the processor for executing the program instructions stored by the computer storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (ApplicationSpecificIntegrated Circuit, ASIC), off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., which are the computational core and control core of the terminal adapted to implement one or more instructions, in particular to load and execute one or more instructions in a computer storage medium to implement the corresponding method flow or corresponding functions, to implement the steps of a method for reconstructing a corpus in the above embodiments.
Example 4
Based on the same inventive concept, the present invention also provides a storage medium, in particular, a computer readable storage medium (Memory), which is a Memory device in a computer device, for storing programs and data. It is understood that the computer readable storage medium herein may include both built-in storage media in a computer device and extended storage media supported by the computer device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the steps of a method for corpus reconstruction in the above embodiments.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.
Claims (15)
1. A method for reconstructing a corpus, the method comprising:
predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result;
determining a confusion matrix corresponding to the corpus based on the prediction result;
determining the confusion degree between knowledge base names in the corpus based on the confusion matrix;
and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus.
2. The method of claim 1, wherein the training process of the pre-trained predictive model comprises:
splitting sample data in a corpus into a plurality of sample data sets;
the initial predictive model is trained using the plurality of sample data sets.
3. The method of claim 2, wherein training the initial predictive model with the plurality of sample data sets comprises:
step 1, initializing k=1;
step 2, training an initial prediction model by taking a kth sample data set as test data and taking the rest sample data sets in the plurality of sample data sets as training data;
step 3, judging whether K is equal to K, if yes, outputting a prediction model, otherwise, enabling k=k+1 and returning to the step 2;
where K is the total number of sample data sets in the plurality of sample data sets.
4. The method of claim 1, wherein the element value of the ith row and jth column of the confusion matrix is B ij Wherein B is ij In order to predict sample data corresponding to the ith knowledge base name by using a prediction model, the number of sample data predicted to be the jth knowledge base name, i, j epsilon [1, N]N is the total number of knowledge base names in the corpus.
5. The method of claim 4, wherein a degree of confusion between knowledge base names in the corpus is score ij =B ij /∑ j∈[1,N] B ij Wherein, score ij Is the confusion between the ith knowledge base name and the jth knowledge base name in the corpus.
6. The method of claim 1, wherein the merging sample data corresponding to each knowledge base name in the corpus based on a degree of confusion between knowledge base names in the corpus comprises:
and when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus belongs to a preset range, merging sample data corresponding to the ith knowledge base name into sample data of the jth knowledge base name.
7. The method of claim 6, wherein the predetermined range is [0.1,0.5].
8. The method of claim 1, wherein the merging sample data corresponding to each knowledge base name in the corpus based on a degree of confusion between knowledge base names in the corpus comprises:
when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus exceeds a preset value, merging sample data corresponding to the ith knowledge base name with sample data of the jth knowledge base name, and defining a new knowledge base name by the obtained new sample data set.
9. The method of claim 8, wherein the preset value is 0.5.
10. The method as recited in claim 1, further comprising:
dividing sample data in a corpus into training data and test data according to a preset proportion;
training the initial prediction model by using the training data and the test data to obtain a data classification model;
and taking the data to be classified as the input of the data classification model, obtaining the knowledge base name corresponding to the data to be classified output by the data classification model, and merging the data to be classified into sample data of the corresponding knowledge base name.
11. The method according to claim 3 or 10, wherein the initial prediction model is a TextCNN model, a DPCNN model or a HAN model.
12. A device for reconstructing a corpus, the device comprising:
the prediction module is used for predicting sample data in the corpus by utilizing a pre-trained prediction model to obtain a prediction result;
the first determining module is used for determining a confusion matrix corresponding to the corpus based on the prediction result;
the second determining module is used for determining the confusion degree among knowledge base names in the corpus based on the confusion matrix;
and the merging module is used for merging the sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus.
13. The apparatus as recited in claim 12, further comprising:
the segmentation module is used for segmenting sample data in the corpus into training data and test data according to a preset proportion;
the training module is used for training the initial prediction model by utilizing the training data and the test data to obtain a data classification model;
and the amplification module is used for taking the data to be classified as the input of the data classification model, obtaining the knowledge base name corresponding to the data to be classified output by the data classification model, and merging the data to be classified into sample data of the corresponding knowledge base name.
14. A computer device, comprising: one or more processors;
the processor is used for storing one or more programs;
the method of reconstructing a corpus according to any one of claims 1 to 11, when said one or more programs are executed by said one or more processors.
15. A computer readable storage medium, characterized in that a computer program is stored thereon, which computer program, when executed, implements a method of reconstructing a corpus according to any of claims 1 to 11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311360974.6A CN117114103A (en) | 2023-10-20 | 2023-10-20 | Corpus reconstruction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311360974.6A CN117114103A (en) | 2023-10-20 | 2023-10-20 | Corpus reconstruction method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117114103A true CN117114103A (en) | 2023-11-24 |
Family
ID=88796923
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311360974.6A Pending CN117114103A (en) | 2023-10-20 | 2023-10-20 | Corpus reconstruction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117114103A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180075368A1 (en) * | 2016-09-12 | 2018-03-15 | International Business Machines Corporation | System and Method of Advising Human Verification of Often-Confused Class Predictions |
CN109101579A (en) * | 2018-07-19 | 2018-12-28 | 深圳追科技有限公司 | customer service robot knowledge base ambiguity detection method |
CN112000808A (en) * | 2020-09-29 | 2020-11-27 | 迪爱斯信息技术股份有限公司 | Data processing method and device and readable storage medium |
CN113934851A (en) * | 2021-11-25 | 2022-01-14 | 和美(深圳)信息技术股份有限公司 | Data enhancement method and device for text classification and electronic equipment |
CN114528933A (en) * | 2022-02-16 | 2022-05-24 | 西安交通大学 | Data unbalance target identification method, system, equipment and storage medium |
CN115879448A (en) * | 2022-01-27 | 2023-03-31 | 北京中关村科金技术有限公司 | Corpus classification method and device, computer readable storage medium and electronic equipment |
-
2023
- 2023-10-20 CN CN202311360974.6A patent/CN117114103A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180075368A1 (en) * | 2016-09-12 | 2018-03-15 | International Business Machines Corporation | System and Method of Advising Human Verification of Often-Confused Class Predictions |
CN109101579A (en) * | 2018-07-19 | 2018-12-28 | 深圳追科技有限公司 | customer service robot knowledge base ambiguity detection method |
CN113407694A (en) * | 2018-07-19 | 2021-09-17 | 深圳追一科技有限公司 | Customer service robot knowledge base ambiguity detection method, device and related equipment |
CN112000808A (en) * | 2020-09-29 | 2020-11-27 | 迪爱斯信息技术股份有限公司 | Data processing method and device and readable storage medium |
CN113934851A (en) * | 2021-11-25 | 2022-01-14 | 和美(深圳)信息技术股份有限公司 | Data enhancement method and device for text classification and electronic equipment |
CN115879448A (en) * | 2022-01-27 | 2023-03-31 | 北京中关村科金技术有限公司 | Corpus classification method and device, computer readable storage medium and electronic equipment |
CN114528933A (en) * | 2022-02-16 | 2022-05-24 | 西安交通大学 | Data unbalance target identification method, system, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112115721B (en) | Named entity recognition method and device | |
CN111275175B (en) | Neural network training method, device, image classification method, device and medium | |
CN110163376B (en) | Sample detection method, media object identification method, device, terminal and medium | |
CN111160959B (en) | User click conversion prediction method and device | |
CN110969600A (en) | Product defect detection method and device, electronic equipment and storage medium | |
CN114155477B (en) | Semi-supervised video paragraph positioning method based on average teacher model | |
CN116756041A (en) | Code defect prediction and positioning method and device, storage medium and computer equipment | |
CN110188798B (en) | Object classification method and model training method and device | |
CN111652286A (en) | Object identification method, device and medium based on graph embedding | |
CN113010420B (en) | Method and terminal equipment for promoting co-evolution of test codes and product codes | |
CN114419420A (en) | Model detection method, device, equipment and storage medium | |
CN112163132B (en) | Data labeling method and device, storage medium and electronic equipment | |
CN111290953A (en) | Method and device for analyzing test logs | |
CN117114103A (en) | Corpus reconstruction method and device | |
CN111611796A (en) | Hypernym determination method and device for hyponym, electronic device and storage medium | |
CN111832435A (en) | Beauty prediction method and device based on migration and weak supervision and storage medium | |
CN116306606A (en) | Financial contract term extraction method and system based on incremental learning | |
CN116681967A (en) | Target detection method, device, equipment and medium | |
KR102413588B1 (en) | Object recognition model recommendation method, system and computer program according to training data | |
CN114912513A (en) | Model training method, information identification method and device | |
CN117523218A (en) | Label generation, training of image classification model and image classification method and device | |
CN117083621A (en) | Detector training method, device and storage medium | |
CN112906728B (en) | Feature comparison method, device and equipment | |
CN113139332A (en) | Automatic model construction method, device and equipment | |
CN113240565B (en) | Target identification method, device, equipment and storage medium based on quantization model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |