CN117114103A

CN117114103A - Corpus reconstruction method and device

Info

Publication number: CN117114103A
Application number: CN202311360974.6A
Authority: CN
Inventors: 郑蓉蓉; 薛文婷; 王晨辉; 曾京文; 于霄洋; 杨林傲; 武志栋; 罗大勇; 张韬; 刘亚庆; 殷红涛; 张哲宁; 魏家辉; 曹津平; 袁韶祖; 祝天刚
Original assignee: State Grid Siji Digital Technology Beijing Co ltd; State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd
Current assignee: State Grid Siji Digital Technology Beijing Co ltd; State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd
Priority date: 2023-10-20
Filing date: 2023-10-20
Publication date: 2023-11-24

Abstract

The invention relates to the technical field of artificial intelligence, and particularly provides a method and a device for reconstructing a corpus, wherein the method comprises the following steps: predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result; determining a confusion matrix corresponding to the corpus based on the prediction result; determining the confusion degree between knowledge base names in the corpus based on the confusion matrix; and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus. The technical scheme provided by the invention can reconstruct and optimize the knowledge base by an automatic discrimination technology, ensures the reliability of the corpus, can discriminate the knowledge of the unknown corpus and expands the existing knowledge base.

Description

Corpus reconstruction method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for reconstructing a corpus.

Background

In the artificial intelligence era, the mainstream algorithm models are trained by relying on data. Thus, the quality of the corpus used to train the deep learning model is critical to the performance and reliability of the model. The valuable manifestation of high quality data is mainly in terms of ensuring accuracy and generalization of the model: 1) Deep learning models require a large amount of diverse and representative data to perform well. The high-quality corpus enables the model to learn a complex mode and make accurate predictions. Conversely, if the training data is of low quality, including errors, bias, or lack of variation, the model may produce inaccurate or biased results. 2) The deep learning model learns knowledge through training data and applies it to the unseen examples is a generalized embodiment. By exposing the model to diversified and high quality data, it can better understand and adapt to new, unknown examples. The effective generalization ability is critical to the performance of the model in real scenes.

The construction and maintenance of the corpus at present mainly depend on manual labeling, which not only increases the cost of data maintenance, but also reduces the corpus quality due to the influence of professional knowledge and state of people, thereby reducing the corpus reliability.

Some existing automatic corpus expansion technologies mainly adopt word replacement mode masks for corpus expansion, and the corpus expansion mode tends to have insufficient corpus quality. Or the pre-trained model is adopted to mark the introduced new knowledge and combine the new knowledge with the existing knowledge system, however, in order to ensure the accuracy of the knowledge, a large model is required to be adopted, so that enough data must be provided, more marking resources are required to be provided, and the model is often not known about knowledge fields in various fields in practice, so that the adaptability is poor.

Disclosure of Invention

In order to overcome the defects, the invention provides a corpus reconstruction method and device.

In a first aspect, a method for reconstructing a corpus is provided, where the method for reconstructing the corpus includes:

predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result;

determining a confusion matrix corresponding to the corpus based on the prediction result;

determining the confusion degree between knowledge base names in the corpus based on the confusion matrix;

and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus.

Preferably, the training process of the pre-trained prediction model includes:

splitting sample data in a corpus into a plurality of sample data sets;

the initial predictive model is trained using the plurality of sample data sets.

Further, the training the initial predictive model using the plurality of sample data sets includes:

step 1, initializing k=1;

step 2, training an initial prediction model by taking a kth sample data set as test data and taking the rest sample data sets in the plurality of sample data sets as training data;

step 3, judging whether K is equal to K, if yes, outputting a prediction model, otherwise, enabling k=k+1 and returning to the step 2;

where K is the total number of sample data sets in the plurality of sample data sets.

Preferably, the first of the confusion matricesThe element value of the j-th column of the i row is B _ij Wherein B is _ij In order to predict sample data corresponding to the ith knowledge base name by using a prediction model, the number of sample data predicted to be the jth knowledge base name, i, j epsilon [1, N]N is the total number of knowledge base names in the corpus.

Further, the confusion degree between the knowledge base names in the corpus is score _ij =B _ij /∑ _j∈[1,N] B _ij Wherein, score _ij Is the confusion between the ith knowledge base name and the jth knowledge base name in the corpus.

Preferably, the merging the sample data corresponding to each knowledge base name in the corpus based on the confusion degree between the knowledge base names in the corpus includes:

and when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus belongs to a preset range, merging sample data corresponding to the ith knowledge base name into sample data of the jth knowledge base name.

Further, the preset range is [0.1,0.5].

when the confusion degree between the ith knowledge base name and the jth knowledge base name in the corpus exceeds a preset value, merging sample data corresponding to the ith knowledge base name with sample data of the jth knowledge base name, and defining a new knowledge base name by the obtained new sample data set.

Further, the preset value is 0.5.

Preferably, the method further comprises:

dividing sample data in a corpus into training data and test data according to a preset proportion;

training the initial prediction model by using the training data and the test data to obtain a data classification model;

and taking the data to be classified as the input of the data classification model, obtaining the knowledge base name corresponding to the data to be classified output by the data classification model, and merging the data to be classified into sample data of the corresponding knowledge base name.

Further, the initial prediction model is a TextCNN model, a DPCNN model or a HAN model.

In a second aspect, a device for reconstructing a corpus is provided, where the device for reconstructing a corpus includes:

the prediction module is used for predicting sample data in the corpus by utilizing a pre-trained prediction model to obtain a prediction result;

the first determining module is used for determining a confusion matrix corresponding to the corpus based on the prediction result;

the second determining module is used for determining the confusion degree among knowledge base names in the corpus based on the confusion matrix;

and the merging module is used for merging the sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus.

Preferably, the method further comprises:

the segmentation module is used for segmenting sample data in the corpus into training data and test data according to a preset proportion;

the training module is used for training the initial prediction model by utilizing the training data and the test data to obtain a data classification model;

and the amplification module is used for taking the data to be classified as the input of the data classification model, obtaining the knowledge base name corresponding to the data to be classified output by the data classification model, and merging the data to be classified into sample data of the corresponding knowledge base name.

In a third aspect, there is provided a computer device comprising: one or more processors;

the processor is used for storing one or more programs;

the method of corpus reconstruction is implemented when the one or more programs are executed by the one or more processors.

In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed, implements the method for reconstructing a corpus.

The technical scheme provided by the invention has at least one or more of the following beneficial effects:

the invention provides a method and a device for reconstructing a corpus, comprising the following steps: predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result; determining a confusion matrix corresponding to the corpus based on the prediction result; determining the confusion degree between knowledge base names in the corpus based on the confusion matrix; and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus. The technical scheme provided by the invention can reconstruct and optimize the knowledge base by an automatic discrimination technology, ensures the reliability of the corpus, can discriminate the knowledge of the unknown corpus and expands the existing knowledge base.

Drawings

FIG. 1 is a flow chart illustrating main steps of a corpus reconstruction method according to an embodiment of the present invention;

fig. 2 is a main block diagram of a corpus reconstruction device according to an embodiment of the present invention.

Detailed Description

The following describes the embodiments of the present invention in further detail with reference to the drawings.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As disclosed in the background, in the artificial intelligence era, the mainstream algorithmic model is trained on data. Thus, the quality of the corpus used to train the deep learning model is critical to the performance and reliability of the model. The valuable manifestation of high quality data is mainly in terms of ensuring accuracy and generalization of the model: 1) Deep learning models require a large amount of diverse and representative data to perform well. The high-quality corpus enables the model to learn a complex mode and make accurate predictions. Conversely, if the training data is of low quality, including errors, bias, or lack of variation, the model may produce inaccurate or biased results. 2) The deep learning model learns knowledge through training data and applies it to the unseen examples is a generalized embodiment. By exposing the model to diversified and high quality data, it can better understand and adapt to new, unknown examples. The effective generalization ability is critical to the performance of the model in real scenes.

In order to improve the above problems, the present invention provides a method and an apparatus for reconstructing a corpus, including: predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result; determining a confusion matrix corresponding to the corpus based on the prediction result; determining the confusion degree between knowledge base names in the corpus based on the confusion matrix; and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus. The technical scheme provided by the invention can reconstruct and optimize the knowledge base by an automatic discrimination technology, ensures the reliability of the corpus, can discriminate the knowledge of the unknown corpus and expands the existing knowledge base.

The above-described scheme is explained in detail below.

Example 1

Referring to fig. 1, fig. 1 is a schematic flow chart of main steps of a corpus reconstruction method according to an embodiment of the present invention. As shown in fig. 1, the method for reconstructing a corpus in the embodiment of the present invention mainly includes the following steps:

step S101: predicting sample data in a corpus by using a pre-trained prediction model to obtain a prediction result;

step S102: determining a confusion matrix corresponding to the corpus based on the prediction result;

step S103: determining the confusion degree between knowledge base names in the corpus based on the confusion matrix;

step S104: and merging sample data corresponding to the knowledge base names in the corpus based on the confusion degree among the knowledge base names in the corpus.

In this embodiment, the training process of the pre-trained prediction model includes:

splitting sample data in a corpus into a plurality of sample data sets;

In one embodiment, the training of the initial predictive model using the plurality of sample data sets includes:

step 1, initializing k=1;

For example: assuming k=3, the data is divided equally into three parts a, B, C, first a lightweight model is trained using the a+b data to obtain model (a+b), C is predicted using model (a+b), and since C does not appear in the training corpus of model (a+b), model (a+b) is reliable and objective for C prediction. By analogy, data A can be predicted from model (B+C). The whole knowledge base can be covered and predicted by repeating the method K times. In practice, the data size of K can be flexibly adjusted according to the size of the corpus and training resources, and any scene can be adapted.

In this embodiment, the element value of the ith row and the jth column of the confusion matrix is B _ij Wherein B is _ij In order to predict sample data corresponding to the ith knowledge base name by using a prediction model, the number of sample data predicted to be the jth knowledge base name, i, j epsilon [1, N]N is the total number of knowledge base names in the corpus.

For example: the confusion matrix for the three corpus of cat, dog, pig is as follows:

however, after obtaining the confusion matrix, how to judge whether the coincidence exists between different knowledge bases is also a problem because of the lack of relevant judgment marks. To solve this problem, the present solution proposes to measure the degree of confusion by using a confusion normalization technique. Specifically, the confusion degree between knowledge base names in the corpus is score _ij =B _ij /∑ _j∈[1,N] B _ij Wherein, score _ij Is the confusion between the ith knowledge base name and the jth knowledge base name in the corpus.

In this embodiment, the merging, based on the confusion between the names of the knowledge bases in the corpus, the sample data corresponding to the names of the knowledge bases in the corpus includes:

In one embodiment, the predetermined range is [0.1,0.5].

In one embodiment, the preset value is 0.5.

In this embodiment, the method further includes:

For example, the data set is segmented according to the above combined and reconstructed data in an 8:2 mode, a model capable of understanding the information of the new combined data is retrained and used for sensing and identifying the new data, and the database can be amplified for the new data.

For a batch of new data, the new data can be labeled and classified through the model which is trained in the step 1 and perceives all the data, and the new data is merged into the existing knowledge base, and the knowledge is purified by adopting the scheme of the classifying threshold value to ensure the accuracy of knowledge, and the new knowledge system is merged after being manually combed out when the classifying threshold value is lower than 0.5 and is placed under the unknown classification (the meaning corpus does not belong to any knowledge base which is currently known).

In this embodiment, the initial prediction model is a TextCNN model, a DPCNN model, or a HAN model.

Example 2

Based on the same inventive concept, the invention also provides a device for reconstructing a corpus, as shown in fig. 2, wherein the device for reconstructing the corpus comprises:

Preferably, the training process of the pre-trained prediction model includes:

splitting sample data in a corpus into a plurality of sample data sets;

step 1, initializing k=1;

Preferably, the element value of the ith row and the jth column of the confusion matrix is B _ij Wherein B is _ij In order to predict sample data corresponding to the ith knowledge base name by using a prediction model, the number of sample data predicted to be the jth knowledge base name, i, j epsilon [1, N]N is the total number of knowledge base names in the corpus.

Preferably, the merging module is specifically configured to include:

Further, the preset range is [0.1,0.5].

Preferably, the merging module is specifically configured to include:

Further, the preset value is 0.5.

Preferably, the method further comprises:

Example 3

Based on the same inventive concept, the invention also provides a computer device comprising a processor and a memory for storing a computer program comprising program instructions, the processor for executing the program instructions stored by the computer storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (ApplicationSpecificIntegrated Circuit, ASIC), off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., which are the computational core and control core of the terminal adapted to implement one or more instructions, in particular to load and execute one or more instructions in a computer storage medium to implement the corresponding method flow or corresponding functions, to implement the steps of a method for reconstructing a corpus in the above embodiments.

Example 4

Based on the same inventive concept, the present invention also provides a storage medium, in particular, a computer readable storage medium (Memory), which is a Memory device in a computer device, for storing programs and data. It is understood that the computer readable storage medium herein may include both built-in storage media in a computer device and extended storage media supported by the computer device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the steps of a method for corpus reconstruction in the above embodiments.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. A method for reconstructing a corpus, the method comprising:

2. The method of claim 1, wherein the training process of the pre-trained predictive model comprises:

splitting sample data in a corpus into a plurality of sample data sets;

3. The method of claim 2, wherein training the initial predictive model with the plurality of sample data sets comprises:

step 1, initializing k=1;

4. The method of claim 1, wherein the element value of the ith row and jth column of the confusion matrix is B _ij Wherein B is _ij In order to predict sample data corresponding to the ith knowledge base name by using a prediction model, the number of sample data predicted to be the jth knowledge base name, i, j epsilon [1, N]N is the total number of knowledge base names in the corpus.

5. The method of claim 4, wherein a degree of confusion between knowledge base names in the corpus is score _ij =B _ij /∑ _j∈[1,N] B _ij Wherein, score _ij Is the confusion between the ith knowledge base name and the jth knowledge base name in the corpus.

6. The method of claim 1, wherein the merging sample data corresponding to each knowledge base name in the corpus based on a degree of confusion between knowledge base names in the corpus comprises:

7. The method of claim 6, wherein the predetermined range is [0.1,0.5].

8. The method of claim 1, wherein the merging sample data corresponding to each knowledge base name in the corpus based on a degree of confusion between knowledge base names in the corpus comprises:

9. The method of claim 8, wherein the preset value is 0.5.

10. The method as recited in claim 1, further comprising:

11. The method according to claim 3 or 10, wherein the initial prediction model is a TextCNN model, a DPCNN model or a HAN model.

12. A device for reconstructing a corpus, the device comprising:

13. The apparatus as recited in claim 12, further comprising:

14. A computer device, comprising: one or more processors;

the processor is used for storing one or more programs;

the method of reconstructing a corpus according to any one of claims 1 to 11, when said one or more programs are executed by said one or more processors.

15. A computer readable storage medium, characterized in that a computer program is stored thereon, which computer program, when executed, implements a method of reconstructing a corpus according to any of claims 1 to 11.