CN117114103A - Corpus reconstruction method and device - Google Patents

Corpus reconstruction method and device Download PDF

Info

Publication number
CN117114103A
CN117114103A CN202311360974.6A CN202311360974A CN117114103A CN 117114103 A CN117114103 A CN 117114103A CN 202311360974 A CN202311360974 A CN 202311360974A CN 117114103 A CN117114103 A CN 117114103A
Authority
CN
China
Prior art keywords
corpus
knowledge base
sample data
data
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311360974.6A
Other languages
Chinese (zh)
Inventor
郑蓉蓉
薛文婷
王晨辉
曾京文
于霄洋
杨林傲
武志栋
罗大勇
张韬
刘亚庆
殷红涛
张哲宁
魏家辉
曹津平
袁韶祖
祝天刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Siji Digital Technology Beijing Co ltd
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Original Assignee
State Grid Siji Digital Technology Beijing Co ltd
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Siji Digital Technology Beijing Co ltd, State Grid Corp of China SGCC, State Grid Information and Telecommunication Co Ltd filed Critical State Grid Siji Digital Technology Beijing Co ltd
Priority to CN202311360974.6A priority Critical patent/CN117114103A/en
Publication of CN117114103A publication Critical patent/CN117114103A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, and particularly provides a method and a device for reconstructing a corpus, wherein the method comprises the following steps: predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result; determining a confusion matrix corresponding to the corpus based on the prediction result; determining the confusion degree between knowledge base names in the corpus based on the confusion matrix; and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus. The technical scheme provided by the invention can reconstruct and optimize the knowledge base by an automatic discrimination technology, ensures the reliability of the corpus, can discriminate the knowledge of the unknown corpus and expands the existing knowledge base.

Description

Corpus reconstruction method and device
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for reconstructing a corpus.
Background
In the artificial intelligence era, the mainstream algorithm models are trained by relying on data. Thus, the quality of the corpus used to train the deep learning model is critical to the performance and reliability of the model. The valuable manifestation of high quality data is mainly in terms of ensuring accuracy and generalization of the model: 1) Deep learning models require a large amount of diverse and representative data to perform well. The high-quality corpus enables the model to learn a complex mode and make accurate predictions. Conversely, if the training data is of low quality, including errors, bias, or lack of variation, the model may produce inaccurate or biased results. 2) The deep learning model learns knowledge through training data and applies it to the unseen examples is a generalized embodiment. By exposing the model to diversified and high quality data, it can better understand and adapt to new, unknown examples. The effective generalization ability is critical to the performance of the model in real scenes.
The construction and maintenance of the corpus at present mainly depend on manual labeling, which not only increases the cost of data maintenance, but also reduces the corpus quality due to the influence of professional knowledge and state of people, thereby reducing the corpus reliability.
Some existing automatic corpus expansion technologies mainly adopt word replacement mode masks for corpus expansion, and the corpus expansion mode tends to have insufficient corpus quality. Or the pre-trained model is adopted to mark the introduced new knowledge and combine the new knowledge with the existing knowledge system, however, in order to ensure the accuracy of the knowledge, a large model is required to be adopted, so that enough data must be provided, more marking resources are required to be provided, and the model is often not known about knowledge fields in various fields in practice, so that the adaptability is poor.
Disclosure of Invention
In order to overcome the defects, the invention provides a corpus reconstruction method and device.
In a first aspect, a method for reconstructing a corpus is provided, where the method for reconstructing the corpus includes:
predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result;
determining a confusion matrix corresponding to the corpus based on the prediction result;
determining the confusion degree between knowledge base names in the corpus based on the confusion matrix;
and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus.
Preferably, the training process of the pre-trained prediction model includes:
splitting sample data in a corpus into a plurality of sample data sets;
the initial predictive model is trained using the plurality of sample data sets.
Further, the training the initial predictive model using the plurality of sample data sets includes:
step 1, initializing k=1;
step 2, training an initial prediction model by taking a kth sample data set as test data and taking the rest sample data sets in the plurality of sample data sets as training data;
step 3, judging whether K is equal to K, if yes, outputting a prediction model, otherwise, enabling k=k+1 and returning to the step 2;
where K is the total number of sample data sets in the plurality of sample data sets.
Preferably, the first of the confusion matricesThe element value of the j-th column of the i row is B ij Wherein B is ij In order to predict sample data corresponding to the ith knowledge base name by using a prediction model, the number of sample data predicted to be the jth knowledge base name, i, j epsilon [1, N]N is the total number of knowledge base names in the corpus.
Further, the confusion degree between the knowledge base names in the corpus is score ij =B ij /∑ j∈[1,N] B ij Wherein, score ij Is the confusion between the ith knowledge base name and the jth knowledge base name in the corpus.
Preferably, the merging the sample data corresponding to each knowledge base name in the corpus based on the confusion degree between the knowledge base names in the corpus includes:
and when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus belongs to a preset range, merging sample data corresponding to the ith knowledge base name into sample data of the jth knowledge base name.
Further, the preset range is [0.1,0.5].
Preferably, the merging the sample data corresponding to each knowledge base name in the corpus based on the confusion degree between the knowledge base names in the corpus includes:
when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus exceeds a preset value, merging sample data corresponding to the ith knowledge base name with sample data of the jth knowledge base name, and defining a new knowledge base name by the obtained new sample data set.
Further, the preset value is 0.5.
Preferably, the method further comprises:
dividing sample data in a corpus into training data and test data according to a preset proportion;
training the initial prediction model by using the training data and the test data to obtain a data classification model;
and taking the data to be classified as the input of the data classification model, obtaining the knowledge base name corresponding to the data to be classified output by the data classification model, and merging the data to be classified into sample data of the corresponding knowledge base name.
Further, the initial prediction model is a TextCNN model, a DPCNN model or a HAN model.
In a second aspect, a device for reconstructing a corpus is provided, where the device for reconstructing a corpus includes:
the prediction module is used for predicting sample data in the corpus by utilizing a pre-trained prediction model to obtain a prediction result;
the first determining module is used for determining a confusion matrix corresponding to the corpus based on the prediction result;
the second determining module is used for determining the confusion degree among knowledge base names in the corpus based on the confusion matrix;
and the merging module is used for merging the sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus.
Preferably, the method further comprises:
the segmentation module is used for segmenting sample data in the corpus into training data and test data according to a preset proportion;
the training module is used for training the initial prediction model by utilizing the training data and the test data to obtain a data classification model;
and the amplification module is used for taking the data to be classified as the input of the data classification model, obtaining the knowledge base name corresponding to the data to be classified output by the data classification model, and merging the data to be classified into sample data of the corresponding knowledge base name.
In a third aspect, there is provided a computer device comprising: one or more processors;
the processor is used for storing one or more programs;
the method of corpus reconstruction is implemented when the one or more programs are executed by the one or more processors.
In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed, implements the method for reconstructing a corpus.
The technical scheme provided by the invention has at least one or more of the following beneficial effects:
the invention provides a method and a device for reconstructing a corpus, comprising the following steps: predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result; determining a confusion matrix corresponding to the corpus based on the prediction result; determining the confusion degree between knowledge base names in the corpus based on the confusion matrix; and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus. The technical scheme provided by the invention can reconstruct and optimize the knowledge base by an automatic discrimination technology, ensures the reliability of the corpus, can discriminate the knowledge of the unknown corpus and expands the existing knowledge base.
Drawings
FIG. 1 is a flow chart illustrating main steps of a corpus reconstruction method according to an embodiment of the present invention;
fig. 2 is a main block diagram of a corpus reconstruction device according to an embodiment of the present invention.
Detailed Description
The following describes the embodiments of the present invention in further detail with reference to the drawings.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As disclosed in the background, in the artificial intelligence era, the mainstream algorithmic model is trained on data. Thus, the quality of the corpus used to train the deep learning model is critical to the performance and reliability of the model. The valuable manifestation of high quality data is mainly in terms of ensuring accuracy and generalization of the model: 1) Deep learning models require a large amount of diverse and representative data to perform well. The high-quality corpus enables the model to learn a complex mode and make accurate predictions. Conversely, if the training data is of low quality, including errors, bias, or lack of variation, the model may produce inaccurate or biased results. 2) The deep learning model learns knowledge through training data and applies it to the unseen examples is a generalized embodiment. By exposing the model to diversified and high quality data, it can better understand and adapt to new, unknown examples. The effective generalization ability is critical to the performance of the model in real scenes.
The construction and maintenance of the corpus at present mainly depend on manual labeling, which not only increases the cost of data maintenance, but also reduces the corpus quality due to the influence of professional knowledge and state of people, thereby reducing the corpus reliability.
Some existing automatic corpus expansion technologies mainly adopt word replacement mode masks for corpus expansion, and the corpus expansion mode tends to have insufficient corpus quality. Or the pre-trained model is adopted to mark the introduced new knowledge and combine the new knowledge with the existing knowledge system, however, in order to ensure the accuracy of the knowledge, a large model is required to be adopted, so that enough data must be provided, more marking resources are required to be provided, and the model is often not known about knowledge fields in various fields in practice, so that the adaptability is poor.
In order to improve the above problems, the present invention provides a method and an apparatus for reconstructing a corpus, including: predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result; determining a confusion matrix corresponding to the corpus based on the prediction result; determining the confusion degree between knowledge base names in the corpus based on the confusion matrix; and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus. The technical scheme provided by the invention can reconstruct and optimize the knowledge base by an automatic discrimination technology, ensures the reliability of the corpus, can discriminate the knowledge of the unknown corpus and expands the existing knowledge base.
The above-described scheme is explained in detail below.
Example 1
Referring to fig. 1, fig. 1 is a schematic flow chart of main steps of a corpus reconstruction method according to an embodiment of the present invention. As shown in fig. 1, the method for reconstructing a corpus in the embodiment of the present invention mainly includes the following steps:
step S101: predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result;
step S102: determining a confusion matrix corresponding to the corpus based on the prediction result;
step S103: determining the confusion degree between knowledge base names in the corpus based on the confusion matrix;
step S104: and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus.
In this embodiment, the training process of the pre-trained prediction model includes:
splitting sample data in a corpus into a plurality of sample data sets;
the initial predictive model is trained using the plurality of sample data sets.
In one embodiment, the training of the initial predictive model using the plurality of sample data sets includes:
step 1, initializing k=1;
step 2, training an initial prediction model by taking a kth sample data set as test data and taking the rest sample data sets in the plurality of sample data sets as training data;
step 3, judging whether K is equal to K, if yes, outputting a prediction model, otherwise, enabling k=k+1 and returning to the step 2;
where K is the total number of sample data sets in the plurality of sample data sets.
For example: assuming k=3, the data is divided equally into three parts a, B, C, first a lightweight model is trained using the a+b data to obtain model (a+b), C is predicted using model (a+b), and since C does not appear in the training corpus of model (a+b), model (a+b) is reliable and objective for C prediction. By analogy, data A can be predicted from model (B+C). The whole knowledge base can be covered and predicted by repeating the method K times. In practice, the data size of K can be flexibly adjusted according to the size of the corpus and training resources, and any scene can be adapted.
In this embodiment, the element value of the ith row and the jth column of the confusion matrix is B ij Wherein B is ij In order to predict sample data corresponding to the ith knowledge base name by using a prediction model, the number of sample data predicted to be the jth knowledge base name, i, j epsilon [1, N]N is the total number of knowledge base names in the corpus.
For example: the confusion matrix for the three corpus of cat, dog, pig is as follows:
however, after obtaining the confusion matrix, how to judge whether the coincidence exists between different knowledge bases is also a problem because of the lack of relevant judgment marks. To solve this problem, the present solution proposes to measure the degree of confusion by using a confusion normalization technique. Specifically, the confusion degree between knowledge base names in the corpus is score ij =B ij /∑ j∈[1,N] B ij Wherein, score ij Is the confusion between the ith knowledge base name and the jth knowledge base name in the corpus.
In this embodiment, the merging, based on the confusion between the names of the knowledge bases in the corpus, the sample data corresponding to the names of the knowledge bases in the corpus includes:
and when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus belongs to a preset range, merging sample data corresponding to the ith knowledge base name into sample data of the jth knowledge base name.
In one embodiment, the predetermined range is [0.1,0.5].
In this embodiment, the merging, based on the confusion between the names of the knowledge bases in the corpus, the sample data corresponding to the names of the knowledge bases in the corpus includes:
when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus exceeds a preset value, merging sample data corresponding to the ith knowledge base name with sample data of the jth knowledge base name, and defining a new knowledge base name by the obtained new sample data set.
In one embodiment, the preset value is 0.5.
In this embodiment, the method further includes:
dividing sample data in a corpus into training data and test data according to a preset proportion;
training the initial prediction model by using the training data and the test data to obtain a data classification model;
and taking the data to be classified as the input of the data classification model, obtaining the knowledge base name corresponding to the data to be classified output by the data classification model, and merging the data to be classified into sample data of the corresponding knowledge base name.
For example, the data set is segmented according to the above combined and reconstructed data in an 8:2 mode, a model capable of understanding the information of the new combined data is retrained and used for sensing and identifying the new data, and the database can be amplified for the new data.
For a batch of new data, the new data can be labeled and classified through the model which is trained in the step 1 and perceives all the data, and the new data is merged into the existing knowledge base, and the knowledge is purified by adopting the scheme of the classifying threshold value to ensure the accuracy of knowledge, and the new knowledge system is merged after being manually combed out when the classifying threshold value is lower than 0.5 and is placed under the unknown classification (the meaning corpus does not belong to any knowledge base which is currently known).
In this embodiment, the initial prediction model is a TextCNN model, a DPCNN model, or a HAN model.
Example 2
Based on the same inventive concept, the invention also provides a device for reconstructing a corpus, as shown in fig. 2, wherein the device for reconstructing the corpus comprises:
the prediction module is used for predicting sample data in the corpus by utilizing a pre-trained prediction model to obtain a prediction result;
the first determining module is used for determining a confusion matrix corresponding to the corpus based on the prediction result;
the second determining module is used for determining the confusion degree among knowledge base names in the corpus based on the confusion matrix;
and the merging module is used for merging the sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus.
Preferably, the training process of the pre-trained prediction model includes:
splitting sample data in a corpus into a plurality of sample data sets;
the initial predictive model is trained using the plurality of sample data sets.
Further, the training the initial predictive model using the plurality of sample data sets includes:
step 1, initializing k=1;
step 2, training an initial prediction model by taking a kth sample data set as test data and taking the rest sample data sets in the plurality of sample data sets as training data;
step 3, judging whether K is equal to K, if yes, outputting a prediction model, otherwise, enabling k=k+1 and returning to the step 2;
where K is the total number of sample data sets in the plurality of sample data sets.
Preferably, the element value of the ith row and the jth column of the confusion matrix is B ij Wherein B is ij In order to predict sample data corresponding to the ith knowledge base name by using a prediction model, the number of sample data predicted to be the jth knowledge base name, i, j epsilon [1, N]N is the total number of knowledge base names in the corpus.
Further, the confusion degree between the knowledge base names in the corpus is score ij =B ij /∑ j∈[1,N] B ij Wherein, score ij Is the confusion between the ith knowledge base name and the jth knowledge base name in the corpus.
Preferably, the merging module is specifically configured to include:
and when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus belongs to a preset range, merging sample data corresponding to the ith knowledge base name into sample data of the jth knowledge base name.
Further, the preset range is [0.1,0.5].
Preferably, the merging module is specifically configured to include:
when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus exceeds a preset value, merging sample data corresponding to the ith knowledge base name with sample data of the jth knowledge base name, and defining a new knowledge base name by the obtained new sample data set.
Further, the preset value is 0.5.
Preferably, the method further comprises:
the segmentation module is used for segmenting sample data in the corpus into training data and test data according to a preset proportion;
the training module is used for training the initial prediction model by utilizing the training data and the test data to obtain a data classification model;
and the amplification module is used for taking the data to be classified as the input of the data classification model, obtaining the knowledge base name corresponding to the data to be classified output by the data classification model, and merging the data to be classified into sample data of the corresponding knowledge base name.
Further, the initial prediction model is a TextCNN model, a DPCNN model or a HAN model.
Example 3
Based on the same inventive concept, the invention also provides a computer device comprising a processor and a memory for storing a computer program comprising program instructions, the processor for executing the program instructions stored by the computer storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (ApplicationSpecificIntegrated Circuit, ASIC), off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., which are the computational core and control core of the terminal adapted to implement one or more instructions, in particular to load and execute one or more instructions in a computer storage medium to implement the corresponding method flow or corresponding functions, to implement the steps of a method for reconstructing a corpus in the above embodiments.
Example 4
Based on the same inventive concept, the present invention also provides a storage medium, in particular, a computer readable storage medium (Memory), which is a Memory device in a computer device, for storing programs and data. It is understood that the computer readable storage medium herein may include both built-in storage media in a computer device and extended storage media supported by the computer device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the steps of a method for corpus reconstruction in the above embodiments.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims (15)

1. A method for reconstructing a corpus, the method comprising:
predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result;
determining a confusion matrix corresponding to the corpus based on the prediction result;
determining the confusion degree between knowledge base names in the corpus based on the confusion matrix;
and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus.
2. The method of claim 1, wherein the training process of the pre-trained predictive model comprises:
splitting sample data in a corpus into a plurality of sample data sets;
the initial predictive model is trained using the plurality of sample data sets.
3. The method of claim 2, wherein training the initial predictive model with the plurality of sample data sets comprises:
step 1, initializing k=1;
step 2, training an initial prediction model by taking a kth sample data set as test data and taking the rest sample data sets in the plurality of sample data sets as training data;
step 3, judging whether K is equal to K, if yes, outputting a prediction model, otherwise, enabling k=k+1 and returning to the step 2;
where K is the total number of sample data sets in the plurality of sample data sets.
4. The method of claim 1, wherein the element value of the ith row and jth column of the confusion matrix is B ij Wherein B is ij In order to predict sample data corresponding to the ith knowledge base name by using a prediction model, the number of sample data predicted to be the jth knowledge base name, i, j epsilon [1, N]N is the total number of knowledge base names in the corpus.
5. The method of claim 4, wherein a degree of confusion between knowledge base names in the corpus is score ij =B ij /∑ j∈[1,N] B ij Wherein, score ij Is the confusion between the ith knowledge base name and the jth knowledge base name in the corpus.
6. The method of claim 1, wherein the merging sample data corresponding to each knowledge base name in the corpus based on a degree of confusion between knowledge base names in the corpus comprises:
and when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus belongs to a preset range, merging sample data corresponding to the ith knowledge base name into sample data of the jth knowledge base name.
7. The method of claim 6, wherein the predetermined range is [0.1,0.5].
8. The method of claim 1, wherein the merging sample data corresponding to each knowledge base name in the corpus based on a degree of confusion between knowledge base names in the corpus comprises:
when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus exceeds a preset value, merging sample data corresponding to the ith knowledge base name with sample data of the jth knowledge base name, and defining a new knowledge base name by the obtained new sample data set.
9. The method of claim 8, wherein the preset value is 0.5.
10. The method as recited in claim 1, further comprising:
dividing sample data in a corpus into training data and test data according to a preset proportion;
training the initial prediction model by using the training data and the test data to obtain a data classification model;
and taking the data to be classified as the input of the data classification model, obtaining the knowledge base name corresponding to the data to be classified output by the data classification model, and merging the data to be classified into sample data of the corresponding knowledge base name.
11. The method according to claim 3 or 10, wherein the initial prediction model is a TextCNN model, a DPCNN model or a HAN model.
12. A device for reconstructing a corpus, the device comprising:
the prediction module is used for predicting sample data in the corpus by utilizing a pre-trained prediction model to obtain a prediction result;
the first determining module is used for determining a confusion matrix corresponding to the corpus based on the prediction result;
the second determining module is used for determining the confusion degree among knowledge base names in the corpus based on the confusion matrix;
and the merging module is used for merging the sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus.
13. The apparatus as recited in claim 12, further comprising:
the segmentation module is used for segmenting sample data in the corpus into training data and test data according to a preset proportion;
the training module is used for training the initial prediction model by utilizing the training data and the test data to obtain a data classification model;
and the amplification module is used for taking the data to be classified as the input of the data classification model, obtaining the knowledge base name corresponding to the data to be classified output by the data classification model, and merging the data to be classified into sample data of the corresponding knowledge base name.
14. A computer device, comprising: one or more processors;
the processor is used for storing one or more programs;
the method of reconstructing a corpus according to any one of claims 1 to 11, when said one or more programs are executed by said one or more processors.
15. A computer readable storage medium, characterized in that a computer program is stored thereon, which computer program, when executed, implements a method of reconstructing a corpus according to any of claims 1 to 11.
CN202311360974.6A 2023-10-20 2023-10-20 Corpus reconstruction method and device Pending CN117114103A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311360974.6A CN117114103A (en) 2023-10-20 2023-10-20 Corpus reconstruction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311360974.6A CN117114103A (en) 2023-10-20 2023-10-20 Corpus reconstruction method and device

Publications (1)

Publication Number Publication Date
CN117114103A true CN117114103A (en) 2023-11-24

Family

ID=88796923

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311360974.6A Pending CN117114103A (en) 2023-10-20 2023-10-20 Corpus reconstruction method and device

Country Status (1)

Country Link
CN (1) CN117114103A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180075368A1 (en) * 2016-09-12 2018-03-15 International Business Machines Corporation System and Method of Advising Human Verification of Often-Confused Class Predictions
CN109101579A (en) * 2018-07-19 2018-12-28 深圳追科技有限公司 customer service robot knowledge base ambiguity detection method
CN112000808A (en) * 2020-09-29 2020-11-27 迪爱斯信息技术股份有限公司 Data processing method and device and readable storage medium
CN113934851A (en) * 2021-11-25 2022-01-14 和美(深圳)信息技术股份有限公司 Data enhancement method and device for text classification and electronic equipment
CN114528933A (en) * 2022-02-16 2022-05-24 西安交通大学 Data unbalance target identification method, system, equipment and storage medium
CN115879448A (en) * 2022-01-27 2023-03-31 北京中关村科金技术有限公司 Corpus classification method and device, computer readable storage medium and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180075368A1 (en) * 2016-09-12 2018-03-15 International Business Machines Corporation System and Method of Advising Human Verification of Often-Confused Class Predictions
CN109101579A (en) * 2018-07-19 2018-12-28 深圳追科技有限公司 customer service robot knowledge base ambiguity detection method
CN113407694A (en) * 2018-07-19 2021-09-17 深圳追一科技有限公司 Customer service robot knowledge base ambiguity detection method, device and related equipment
CN112000808A (en) * 2020-09-29 2020-11-27 迪爱斯信息技术股份有限公司 Data processing method and device and readable storage medium
CN113934851A (en) * 2021-11-25 2022-01-14 和美(深圳)信息技术股份有限公司 Data enhancement method and device for text classification and electronic equipment
CN115879448A (en) * 2022-01-27 2023-03-31 北京中关村科金技术有限公司 Corpus classification method and device, computer readable storage medium and electronic equipment
CN114528933A (en) * 2022-02-16 2022-05-24 西安交通大学 Data unbalance target identification method, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112115721B (en) Named entity recognition method and device
CN111275175B (en) Neural network training method, device, image classification method, device and medium
CN110163376B (en) Sample detection method, media object identification method, device, terminal and medium
CN111160959B (en) User click conversion prediction method and device
CN110969600A (en) Product defect detection method and device, electronic equipment and storage medium
CN114155477B (en) Semi-supervised video paragraph positioning method based on average teacher model
CN116756041A (en) Code defect prediction and positioning method and device, storage medium and computer equipment
CN110188798B (en) Object classification method and model training method and device
CN111652286A (en) Object identification method, device and medium based on graph embedding
CN113010420B (en) Method and terminal equipment for promoting co-evolution of test codes and product codes
CN114419420A (en) Model detection method, device, equipment and storage medium
CN112163132B (en) Data labeling method and device, storage medium and electronic equipment
CN111290953A (en) Method and device for analyzing test logs
CN117114103A (en) Corpus reconstruction method and device
CN111611796A (en) Hypernym determination method and device for hyponym, electronic device and storage medium
CN111832435A (en) Beauty prediction method and device based on migration and weak supervision and storage medium
CN116306606A (en) Financial contract term extraction method and system based on incremental learning
CN116681967A (en) Target detection method, device, equipment and medium
KR102413588B1 (en) Object recognition model recommendation method, system and computer program according to training data
CN114912513A (en) Model training method, information identification method and device
CN117523218A (en) Label generation, training of image classification model and image classification method and device
CN117083621A (en) Detector training method, device and storage medium
CN112906728B (en) Feature comparison method, device and equipment
CN113139332A (en) Automatic model construction method, device and equipment
CN113240565B (en) Target identification method, device, equipment and storage medium based on quantization model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination