CN113761880B

CN113761880B - Data processing method for text verification, electronic equipment and storage medium

Info

Publication number: CN113761880B
Application number: CN202111310983.5A
Authority: CN
Inventors: 刘远; 陈旻晖
Original assignee: Clp Suzhou Shared Services Co ltd; Beijing Zhongdian Huizhi Technology Co ltd
Current assignee: Clp Suzhou Shared Services Co ltd; Beijing Zhongdian Huizhi Technology Co ltd
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-03-04
Anticipated expiration: 2041-11-08
Also published as: CN113761880A

Abstract

The invention relates to a data processing method, electronic equipment and a storage medium for text verification, wherein the method comprises the following steps: obtaining a sample text list from a text database, marking a keyword position of the sample text as a designated initial position and a finishing position of the sample text as a designated finishing position when a keyword consistent with any preset keyword in the preset keyword list exists in any sample text, and taking a speech segment between the designated initial position and the designated finishing position as a target speech segment, and taking the sample text based on the existing target speech segment as training set data to construct a training set; inputting the training set into a preset language model for training to obtain a trained language model; and acquiring the knowledge graph of the target text through the trained language model so as to compare the knowledge graph with preset verification data. The method and the device can improve the accuracy and efficiency of comparing the structured text data with the semi-structured text data.

Description

Data processing method for text verification, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of data processing, in particular to a data processing method for text verification, electronic equipment and a storage medium.

Background

In the prior art, text data is divided into three types: structured text data, random text data and semi-structured text data; in the structured text data, the text data at a specific position has a specific meaning and is easy to be converted into a table structure in a relational database, such as text data in a cvs format, invoice text data after OCR processing or settlement statement data in a specific field of a power system; in the random text data, the text data at each text position has a random meaning, for example, the text data of literary works such as news, novels, prose and the like spread on the internet; the semi-structured text data is intermediate between the structured text data and the random text data, and text data at a specific position may have a specific meaning, but is difficult to be converted into a table structure in a relational database, for example, settlement terms in a contract in a specific field such as an electric power system, and the like.

In some application scenarios, especially in a settlement auditing scenario of an electric power system, structured text data and semi-structured text data need to be compared, that is, whether the structured data in a settlement document meets the requirements of the semi-structured settlement terms in a contract or not is judged, but because the semi-structured text data is difficult to be converted into a table structure of a relational database, the efficiency and accuracy of data comparison are low due to the fact that the semi-structured text data is compared in a manual mode in the prior art, and the data verification process is affected.

Disclosure of Invention

In order to solve the above technical problems, the present application adopts a technical solution of a data processing method, an electronic device, and a storage medium for text verification, where the method includes the steps of:

s100, acquiring m first texts from a first text set of a text database as sample texts, and constructing a sample text list A = (A)₁，A₂，A₃，……，A_m），A_iI =1 … … m, and when A is the ith sample text_iWhen the keyword exists in the preset keyword list, the A is matched with any preset keyword in the preset keyword list_iIs marked as specifying a starting position and A_iIs marked as a specified end position, and the speech segment between the specified start position and the specified end position is taken as A_iBased on the presence of A of the target language fragment_iConstructing a training set as training set data;

s200, inputting the training set into a preset language model for training to obtain a trained language model;

s300, obtaining a target text, inputting the target text into a trained language model, and obtaining a target data list B = (B) corresponding to the target text₁，B₂，B₃，……，B_n），B_jJ =2 … … n, n is the target data number, and each B in B is_jWith a number of preset ternary groupsThe frame is used for acquiring a target knowledge graph corresponding to the target text;

s400, acquiring a text ID of a target text, acquiring all verification data corresponding to the text ID of the target text from a verification data list according to the text ID of the target text, and constructing a first intermediate data list by taking each verification data as first intermediate data;

s500, traversing the target knowledge graph, and replacing any target data in the target knowledge graph with corresponding target data when the target data is inconsistent with the corresponding first intermediate data in the first intermediate data list.

The present invention also provides a non-transitory computer-readable storage medium that can be configured in an electronic device to store at least one instruction or at least one program for implementing a method of the method embodiments, where the at least one instruction or the at least one program is loaded by a processor and executed to implement the method provided by the above embodiments.

The invention also provides an electronic device comprising a processor and the aforementioned non-transitory computer-readable storage medium.

Compared with the prior art, the invention has obvious advantages and beneficial effects. By the technical scheme, the data processing system for acquiring the target position can achieve considerable technical progress and practicability, has wide industrial utilization value and at least has the following advantages:

the method comprises the steps of obtaining a sample text list, marking a keyword position of the sample text as a designated initial position and marking an end position of the sample text as a designated end position when a keyword consistent with any preset keyword in the preset keyword list exists in the sample text, taking a speech segment between the designated initial position and the designated end position as a target speech segment of the sample text, and taking the sample text based on the target speech segment as training set data to construct a training set; inputting the training set into a preset language model for training to obtain a trained language model;

the language model is optimized, the target language segment capable of extracting the data with the specific meaning can be accurately and efficiently determined, the extraction of the full text data and the interference of other data are reduced, and the comparison of the data in the text is facilitated;

meanwhile, inputting a target text into a trained language model, acquiring a characteristic value list corresponding to the target text, and acquiring a target knowledge graph corresponding to the target text by constructing each characteristic value by a plurality of preset triples; the data in the semi-structured text can be stored in a knowledge map form, the storage mode is optimized, the comparison of the data in the text is facilitated, and the efficiency and the accuracy of the verification of the structured text data and the semi-structured text data are improved.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are described in detail with reference to the accompanying drawings.

Drawings

Fig. 1 is a flowchart of a data processing method for text verification according to an embodiment of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description will be given with reference to the accompanying drawings and preferred embodiments of a data processing method, an electronic device and a storage medium for text verification according to the present invention.

The embodiment of the invention provides a data processing method for text verification, which further comprises the following steps, as shown in fig. 1:

s100, acquiring m first texts from a first text set of a text database as sample texts, and constructing a sample text list A = (A)₁，A₂，A₃，……，A_m），A_iI =1 … … m, and when the text is the ith sample textA_iWhen the keyword exists in the preset keyword list, the A is matched with any preset keyword in the preset keyword list_iIs marked as specifying a starting position and A_iIs marked as a specified end position, and the speech segment between the specified start position and the specified end position is taken as A_iBased on the presence of A of the target language fragment_iAs training set data, a training set is constructed.

Specifically, the method further includes the following steps before the step S100:

the text types of all the first texts are obtained, and the first texts of the same type are classified according to preset text division rules to construct a plurality of first text sets.

Preferably, the text division rule refers to a preset rule for dividing the text by the text type of the first text, where the text type of the first text is, for example, a purchase text, a statistical text, or an order text.

Specifically, the first text is a text storing semi-structured text data, wherein all sample texts in a constructed based on the first text set are texts of the same type, so that a preset language model can be conveniently trained, the accuracy of model training is improved, and the accuracy and efficiency of comparison between the structured text data and the semi-structured text data are improved.

Specifically, in step S100, A_iThe keywords in the text are determined by a natural language processing method, and the keywords can be extracted from the sample text to determine the language segments capable of obtaining the key data, so that the accuracy and efficiency of comparing the structured text data with the semi-structured text data are improved.

Preferably, the preset keyword list is a preset keyword list, and the keyword list field includes a keyword corresponding to a text type of any of the first texts, which can be understood as follows: at step S100In step, traverse A_iAnd according to A_iText type, obtaining A from preset keyword list_iAll preset keywords corresponding to the text type are used as target keywords to obtain A_iThe comparison of the keywords with all the target keywords can be facilitated, the comparison of the keywords in the sample text can be facilitated, the language segment capable of obtaining the key data can be determined, and the accuracy and the efficiency of the comparison of the structured text data and the semi-structured text data are improved.

Specifically, the key data refers to data with a local special meaning in the sample text, and the special meaning needs to be determined according to the text type, which is not described herein again.

S200, inputting the training set into a preset language model for training to obtain a trained language model.

Specifically, the step S200 further includes the steps of:

s201, concentrating the training set A_iInputting the data into a preset language model to obtain A_iCorresponding key data are constructed into a key data list S_iIn this embodiment, a method for obtaining a feature value by any language model in the art may be adopted, which is not described herein again;

s203, obtaining A_iCorresponding text ID, and according to A_iCorresponding text ID, obtaining A from the verification data list_iConstructing a second intermediate data list by using all the verification data of the corresponding text ID as second intermediate data;

s205, traverse A_iCorresponding key data list and according to said A_iCorresponding Key data List and A_iAnd determining the probability value F of the A by the corresponding second intermediate data list, wherein the F meets the following conditions:

wherein S is_iIs the A_iThe amount of critical data in the corresponding list of critical data,

is the A_iThe number of data in the corresponding key data list which is inconsistent with the corresponding second intermediate data in the second intermediate data list;

s207, traversing A, and obtaining a trained language model when F is larger than or equal to a preset probability threshold;

s209, when F is less than the preset probability threshold, the sample text list is obtained again

According to

Iterating until F is larger than or equal to a preset probability threshold value to obtain a trained language model, wherein the iteration process is based on

After the step S100 processing is executed, the method reacquires

The corresponding probability process is not described herein again.

Further, the text ID refers to a unique identification for identifying the text.

Preferably, the language model is a Bert model.

Preferably, in the step S209,

the same sample text as a can be further understood as: requiring re-acquisition when retraining the language model

Is the same text type as A, and

including A_iCorresponding probability F_iSample text < preset probability threshold and not including A_iCorresponding probability F_iSample text of a probability threshold value, wherein F_iThe following conditions are met:

。

further, the probability threshold range is 90-98%, preferably, the probability threshold is 90%.

In another specific embodiment, the method comprises the following steps:

obtaining the same sample text list A, and collecting A in the training set_iInputting the data into a preset language model to obtain A_iCorresponding key data are constructed into a key data list;

obtaining A_iCorresponding text ID, and according to A_iCorresponding text ID, obtaining A from the verification data list_iConstructing a second intermediate data list by using all the verification data of the corresponding text ID as second intermediate data;

traverse A_iCorresponding key data list and according to said A_iCorresponding Key data List and A_iCorresponding to the second intermediate data list, determining the probability value of A

。

As can be seen from the large amount of experimental data obtained by the method of the above embodiment, in the case of using the same sample text list,

compared with F, the number of the target language segments is reduced by at least 10%, namely F corresponding to target language segment marking of the sample text is not reduced by 10% and F corresponding to target language segment marking of the sample text, so that the method can further explain that the extraction of the full text data and the interference of other data are reduced by checking the determination of the target language segments in the implementation, and the comparison of the data in the text is facilitated.

S300, obtaining the eyesMarking a text and inputting the target text into a trained language model, and acquiring a target data list B = (B) corresponding to the target text₁，B₂，B₃，……，B_n），B_jJ =2 … … n, n is the target data number, and each B in B is_jAnd acquiring a target knowledge graph corresponding to the target text by using a plurality of preset triple frameworks.

Specifically, the step S300 further includes the steps of:

all B are_jInserting the target texts into each preset triple framework as entities to construct a plurality of knowledge graphs of the target texts, and inserting the maximum quantity B into the knowledge graphs of the target texts_jThe target knowledge graph is understood as follows: each text type of the first text corresponds to a plurality of preset triad frameworks, and B is used for determining the type of the first text_jThe constructed knowledge graph is used as a target knowledge graph, so that a suitable knowledge graph can be quickly constructed to store data, and meanwhile, comparison between the knowledge graph and verification data is facilitated, namely comparison between semi-structured text data and structured text data; the target data refers to data with special meaning in the target text, and the special meaning needs to be determined according to the text type, which is not described herein again.

Specifically, the target text refers to any first text in the text database except the sample text, and the target text is consistent with the text type of the sample text in the training set used for training the language model, which can be understood as: the target text is consistent with the text types of all sample texts in A, and meanwhile, the target text does not need to mark the starting position of a speech fragment.

S400, acquiring the text ID of the target text, acquiring all verification data corresponding to the text ID of the target text from a verification data list according to the text ID of the target text, and constructing a first intermediate data list by taking each verification data as first intermediate data.

Specifically, the step S400 further includes the steps of:

according to the text ID of the first text, a plurality of second texts corresponding to the text ID of the first text are obtained from a text database, all the second texts are preprocessed, designated data are obtained from the second texts and serve as verification data of the first text, a verification data list is constructed according to the verification data of all the first texts and the text ID of the first text, the second text is a text which records data corresponding to the data used for verifying the first text, and the second text is a structured text.

Specifically, the step S500 further includes the steps of:

s501, traversing the target knowledge graph and acquiring target data corresponding to each entity in a target triple framework from the target knowledge graph, wherein the target triple framework in the step S501 refers to the triple framework corresponding to the target knowledge graph;

s502, according to the entity of the target triple structure, obtaining first intermediate data corresponding to the entity from the first intermediate data list, which can be understood as: the entity in the target triple structure is a field name in a check data list;

s503, comparing the target data with the corresponding first intermediate data;

and S505, when the target data is inconsistent with the corresponding first intermediate data, replacing the first intermediate data with the corresponding target data.

In the embodiment, the comparison of the structured data to the semi-structured data can be realized, and the efficiency and the accuracy of the verification of the structured data to the semi-structured data are improved.

The method comprises the steps that a sample text list is obtained, when a keyword consistent with any preset keyword in the preset keyword list exists in a sample text, the position of the keyword of the sample text is marked as a specified starting position, the end position of the sample text is marked as a specified end position, a speech segment between the specified starting position and the specified end position is used as a target speech segment of the sample text, and the sample text based on the target speech segment exists as training set data to construct a training set; the training set is input into a preset language model for training to obtain a trained language model, so that the language model is optimized, a target language segment capable of extracting specific meaning data can be accurately and efficiently determined, extraction of full text data and interference of other data are reduced, and comparison of data in a text is facilitated.

Meanwhile, the target text is input into the trained language model, the characteristic value list corresponding to the target text is obtained, each characteristic value is constructed by a plurality of preset triples, the target knowledge graph corresponding to the target text is obtained, data in the semi-structured text can be stored in the form of the knowledge graph, the storage mode is optimized, comparison of the data in the text is facilitated, and the efficiency and accuracy of data verification are improved.

Embodiments of the present application also provide a non-transitory computer-readable storage medium that can be disposed in an electronic device to store at least one instruction or at least one program for implementing a method of the method embodiments, where the at least one instruction or the at least one program is loaded into and executed by a processor to implement the method provided by the above embodiments.

Embodiments of the present application also provide an electronic device comprising a processor and the aforementioned non-transitory computer-readable storage medium.

Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A data processing method for text verification, the method further comprising the steps of:

s100, acquiring m first texts from a first text set of a text database as sample texts, and constructing a sample text list A = (A)₁，A₂，A₃，……，A_m），A_iI =1 … … m, and when A is the ith sample text_iWhen the keyword exists in the preset keyword list, the A is matched with any preset keyword in the preset keyword list_iIs marked as specifying a starting position and A_iIs marked as a specified end position, and the speech segment between the specified start position and the specified end position is taken as A_iBased on the presence of A of the target language fragment_iConstructing a training set as training set data, wherein the first text refers to a text storing semi-structured data;

s200, inputting the training set into a preset language model for training to obtain a trained language model, wherein the step S200 further comprises the following steps:

s201, concentrating the training set A_iInputting the data into a preset language model to obtain A_iCorresponding key data are constructed into a key data list S_i；

s205, traverse A_iCorresponding key data list andaccording to said A_iCorresponding Key data List and A_iAnd determining the probability value F of the A by the corresponding second intermediate data list, wherein the F meets the following conditions:

According to

Performing iteration until F is larger than or equal to a preset probability threshold value to obtain a trained language model;

the step S209 includes:

may have the same sample text as A, and need to be retrieved when the language model is retrained

Is the same text type as A, and

including A_iCorresponding probability F_iSamples of < Preset probability thresholdText and not including A_iCorresponding probability F_iSample text that is greater than or equal to a preset probability threshold, wherein,

F_ithe following conditions are met:

；

s300, obtaining a target text, inputting the target text into a trained language model, and obtaining a target data list B = (B) corresponding to the target text₁，B₂，B₃，……，B_n），B_jJ =2 … … n, n is the target data number, and each B in B is_jAcquiring a target knowledge graph corresponding to the target text by using a plurality of preset triple frameworks;

s400, acquiring a text ID of a target text, acquiring all verification data corresponding to the text ID of the target text from a verification data list according to the text ID of the target text, and constructing a first intermediate data list by taking each verification data as first intermediate data, wherein the target text refers to any first text except a sample text in a text database;

wherein, the step of S400 further comprises the following steps: according to the text ID of the first text, acquiring a plurality of second texts corresponding to the text ID of the first text from a text database, preprocessing all the second texts, acquiring designated data from the second texts to serve as verification data of the first text, and constructing a verification data list according to the verification data of all the first texts and the text ID of the first text, wherein the second text is a text recorded with data corresponding to the data for verifying the first text, and the second text is a structured text;

2. The data processing method for text verification according to claim 1, wherein in step S100, a_iThe keywords in (1) are determined by a natural language processing method.

3. The data processing method for text verification according to claim 1, further comprising the following steps in the step S300:

all B are_jInserting the target texts into each preset triple framework as entities to construct a plurality of knowledge graphs of the target texts, and inserting the maximum quantity B into the knowledge graphs of the target texts_jThe target knowledge graph is used as the knowledge graph of (1).

4. The data processing method for text verification according to claim 1, wherein the target text refers to any first text in the text database except the sample text.

5. The data processing method for text verification according to claim 1, further comprising the following steps in the step S400:

according to the text ID of the first text, a plurality of second texts corresponding to the text ID of the first text are obtained from a text database, all the second texts are preprocessed, key data are extracted to serve as verification data of the first text, and a verification data list is constructed according to the verification data of all the first texts and the text ID of the first text.

6. The data processing method for text verification according to claim 5, wherein the second text is a text corresponding to the data recorded for verifying the first text.

7. A non-transitory computer readable storage medium having stored therein at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by a processor to implement the method of any of claims 1-6.

8. An electronic device comprising a processor and the non-transitory computer readable storage medium of claim 7.