CN115048925B

CN115048925B - Data processing system for determining abnormal text

Info

Publication number: CN115048925B
Application number: CN202210976335.1A
Authority: CN
Inventors: 张正义; 林方; 傅晓航; 常鸿宇
Original assignee: Zhongke Yuchen Technology Co Ltd
Current assignee: Zhongke Yuchen Technology Co Ltd
Priority date: 2022-08-15
Filing date: 2022-08-15
Publication date: 2022-11-04
Anticipated expiration: 2042-08-15
Also published as: CN115048925A

Abstract

The invention provides a data processing system for determining abnormal texts, which comprises: a database, a processor and a memory storing a computer program which, when executed by the processor, performs the steps of: acquiring an initial statement list and initial characters according to the initial text; acquiring an entity relationship probability list and a target entity relationship probability list; acquiring the maximum probability value of the target entity relationship; acquiring the priority corresponding to the initial text; when the priority is greater than or equal to a preset priority threshold, determining the initial text as a target text; otherwise, determining the initial text as an abnormal text. On one hand, in the process of processing the text, the entity is extracted by only using one preset model, so that the used text data is less, the workload of marking personnel is reduced, and on the other hand, in the process of extracting the entity relationship, the entity relationship is extracted by using a plurality of methods, so that the accuracy of the model for predicting the entity relationship is improved.

Description

Data processing system for determining abnormal text

Technical Field

The invention relates to the technical field of text processing, in particular to a data processing system for determining abnormal texts.

Background

Most of existing abnormal text determination methods determine whether an entity in a text meets a preset condition, input the text to be recognized into an entity recognition model, extract various entities from the text, process the extracted entities, and determine that the text is an abnormal text when the entity does not meet the preset condition.

The existing entity relationship extraction method comprises the following steps: determining a sentence distributed vector of each sentence in the target sentence sub-packet and a relation distributed vector of the query relation; each sentence in the target sentence sub-packet comprises a first entity and a second entity; determining a strategy function according to the sentence distributed vectors and the relation distributed vectors, and dividing all sentences in the target sentence sub-packet into positive examples and unmarked examples by utilizing the strategy function; and training a relation extraction model by using the normal example and the unmarked example to obtain the entity relation between the first entity and the second entity.

However, the above method also has the following technical problems:

firstly, in the process of processing the text, a plurality of preset models are required to be used for extracting the entity, a large amount of text data and storage space are required to be used, and the workload of a marking person is overlarge.

Secondly, in the extraction process of the entity relationship, the processing process is single, the entity relationship can be extracted only in one mode, and the prediction accuracy of the model on the entity relationship is low.

Disclosure of Invention

Aiming at the technical problems, the technical scheme adopted by the invention is as follows:

a data processing system for determining anomalous text, said system comprising: a database, a processor, and a memory storing a computer program, wherein the database comprises: initial text list H = { H = ₁ ，……，H _i ，……，H _m }，H _i For the ith initial text, i =1 \ 8230; \8230m, m is the number of initial texts, which when executed by a processor, performs the following steps:

s100 according to H _i Obtaining H _i Corresponding initial sentence list D _i ={D _i1 ，……，D _ij ，……，D _ini }，D _ij =（D ¹ _ij ，……，D ^r _ij ，……，D ^sj _ij ），D ^r _ij Is H _i J =1 \ 8230; ni, ni is H for the r initial character of the jth initial sentence _i R =1 \8230 \ 8230;, sj, sj is the number of initial characters in the jth initial sentence.

S200, according to D ^r _ij Obtaining D ^r _ij Corresponding initial entity relationship list G ^r _ij ={G ^r1 _ij ，……，G ^rx _ij ，……，G ^rq _ij }，G ^rx _ij Is D ^r _ij The probability value of the corresponding x-th class initial entity relationship, x =1 \82308230q, q is the number of types of initial entity relationships.

S300, traverse G ^r _ij And when G is ^rx _ij When the type of the corresponding initial entity relationship is the type of the non-target relationship, the secondary G ^r _ij Deletion in G ^rx _ij Construction of D ^r _ij Corresponding target entity relationship list U ^r _ij ={U ^r1 _ij ，……，U ^ry _ij ，……，U ^rp _ij }，U ^ry _ij Is D ^r _ij The probability value of the corresponding y-th category target entity relationship, y =1 \ 8230 \ 8230, p, p is the type number of the target entity relationship.

S400, traversing U ^r _ij And when U is turned ^ry _ij ≥U ₀ While from U ^r _ij In obtaining the maximum probability value, U, of the target entity relationship ₀ Is a preset confidence threshold.

S500, determining H according to the maximum probability value of the target entity relationship _i Is an exception text.

The invention has at least the following beneficial effects:

the invention provides a data processing system for determining abnormal texts, which comprises: a database, a processor, and a memory storing a computer program, wherein the database comprises: an initial text set, which when executed by a processor, performs the steps of: acquiring an initial sentence list and initial characters corresponding to the initial text according to the initial text; acquiring a corresponding entity relationship probability list according to the initial character, and performing traversal processing on the entity relationship probability list to acquire a target entity relationship probability list corresponding to the initial character; traversing the target entity relationship probability list to obtain the maximum probability value of the target entity relationship; acquiring the priority corresponding to the initial text according to the maximum probability value of the target entity relationship; when the priority is greater than or equal to a preset priority threshold, determining the initial text as a target text; when the priority is smaller than a preset priority threshold, determining the initial text as an abnormal text; therefore, on one hand, in the process of processing the text, the entity can be extracted by using only one preset model, so that the used text data is less, and the workload of marking personnel is reduced; on the other hand, in the extraction process of the entity relationship, the entity relationship can be extracted by using various methods, so that the prediction accuracy of the model on the entity relationship is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a flowchart of a computer program executed by a data processing system for determining an abnormal text according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment provides a data processing system for determining an abnormal text, which includes: a database, a processor, and a memory storing a computer program, wherein the database comprises: initial text list H = { H = ₁ ，……，H _i ，……，H _m }，H _i For the ith initial text, i =1 \ 8230 \8230, m, m is the number of initial texts, which when executed by a processor, implements the following steps, as shown in fig. 1:

s100 according to H _i Obtaining H _i Corresponding initial sentence list D _i ={D _i1 ，……，D _ij ，……，D _ini }，D _ij =（D ¹ _ij ，……，D ^r _ij ，……，D ^sj _ij ），D ^r _ij Is H _i J =1 \ 8230; ni, ni is H for the r initial character of the jth initial sentence _i The number of the initial sentences in (1) \8230 \ 8230 \ 8230, sj, sj is the number of the initial characters in the j initial sentence.

Specifically, the initial sentence is a sentence obtained by performing sentence splitting processing on an initial text, where a person skilled in the art knows that any sentence splitting processing method in the prior art belongs to the protection scope of this embodiment, and is not described herein again.

Further, the initial character refers to any character in the initial sentence.

S200, according to D ^r _ij Obtaining D ^r _ij Corresponding initial entity relationship list G ^r _ij ={G ^r1 _ij ，……，G ^rx _ij ，……，G ^rq _ij }，G ^rx _ij Is D ^r _ij The probability value of the corresponding x-th type initial entity relation, x =1 \8230, wherein \8230, q and q are the type number of the initial entity relation.

Specifically, the initial entity relationship is an association relationship between entities in a triple corresponding to the initial sentence, wherein a probability value of the entity relationship and the triple are obtained through a preset model.

Further, the preset model is an active learning model, and those skilled in the art know that any active learning model in the prior art belongs to the protection scope of the present embodiment, and will not be described herein again.

In particular, Σ ^q _x=1 （G ^rx _ij ）=1。

S300, go through G ^r _ij And when G is ^rx _ij When the type of the corresponding initial entity relationship is the type of the non-target relationship, the secondary G ^r _ij Deletion in G ^rx _ij Construction of D ^r _ij Corresponding target entity relation list U ^r _ij ={U ^r1 _ij ，……，U ^ry _ij ，……，U ^rp _ij }，U ^ry _ij Is D ^r _ij And the probability value of the corresponding y-th target entity relationship, y =1 \8230, wherein \8230p, p is the type number of the target entity relationship.

Specifically, the non-target relationship is a relationship without an association state between entities, for example, the non-target relationship.

S400, traversing U ^r _ij And when U is turned ^ry _ij ≥U ₀ Then, from U ^r _ij In order to obtain the maximum probability value, U, of the target entity relationship ₀ Is a preset probability value threshold.

Specifically, U ₀ The value range of (A) is 0.5 to 0.6.

Preferably, U ₀ The value is 0.5, and the situation that the threshold value is set too high, so that some relation probabilities are deleted, data are omitted, and further, the abnormal text is judged inaccurately can be avoided.

S500, determining H according to the maximum probability value of the target entity relationship _i Is an abnormal text.

Specifically, step S500 includes the steps of:

s501, based on U ^r _ij The maximum probability value of the medium target entity relationship is constructed into H _i Corresponding first intermediate data list V _i ={V _i1 ，……，V _ij ，……，V _ini }，V _ij =（V ¹ _ij ，……，V ^t _ij ，……，V ^kj _ij ），V ^t _ij Is H _i The probability value of the t target character in the j initial sentence, t =1 \ 8230 \ 8230: \ 8230j, kj is the number of the target characters in the j initial sentence.

S503 according to V ^t _ij Obtaining F ⁰ _i ，F ⁰ _i The following conditions are met:

。

s505, when F ⁰ _i ≥F ₀ Then, H is determined _i Is a target text, wherein F ₀ Is a preset priority threshold.

S507, when F ⁰ _i <F ₀ When H is determined _i Is an abnormal text.

In the process of processing the text, the entity and the data in the text are extracted by using only one preset model instead of extracting the entity and the data in the text by using multiple preset models, so that the used text data are less, and the workload of marking personnel is reduced.

In a specific embodiment, step S500 further includes the following steps:

s501, obtaining H according to the maximum probability value of the target entity relationship _i Corresponding priority list F _i ={F _i1 ，……，F _iy ，……，F _ip }，F _iy Is H _i The corresponding priority of the y-th category target entity relationship.

Wherein, F is obtained in the step S501 through the following steps _iy ：

S5011 based on U ^r _ij Middle goal practiceThe maximum probability value of the body relationship is constructed as H _i Corresponding second intermediate data list C _i ={C _i1 ，……，C _iy ，……，C _ip }，C _iy ={C ¹ _iy ，……，C ^g _iy ，……，C ^zy _iy }，C ^g _iy =（C ^g1 _iy ，……，C ^ge _iy ，……，C ^gwg _iy ），C ^ge _iy Is at H _i In the corresponding y category target entity relation, the probability value of the e target character in the g initial sentence is g =1 \ 8230, wherein \ 8230zy is the number of the g initial sentences in the y category target entity relation, and e =1 \ 8230, wherein \ 8230wg and wg are the number of the target characters in the g initial sentence.

S5013 according to C ^ge _iy ，F _iy The following conditions are met:

。

s503, when F _ip ≥F ₀ When H is determined _i Is a target text, wherein F ₀ Is a preset priority threshold.

S505, when F _ip <F ₀ When H is determined _i Is an abnormal text.

Compared with the embodiment, the classification method is adopted when the abnormal text is determined, the initial text can be determined to be the abnormal text only when any kind of relation does not exist between the entities, and the prediction accuracy of the model on the entity relation is improved.

In another embodiment, step S500 further includes the steps of:

。

S507, when F ⁰ _i <F ₀ Then, H is determined _i Is a text to be processed and performs the step S509.

S509, according to H _i Obtaining the maximum probability value of the target entity relation to obtain H _i Corresponding priority list F _i ={F _i1 ，……，F _iy ，……，F _ip }，F _iy Is H _i The corresponding priority of the y-th category target entity relationship.

Further, F is also acquired in step S509 by the following steps _iy ：

S5091 based on U ^r _ij The maximum probability value of the medium target entity relationship is constructed into H _i Corresponding second intermediate data list C _i ={C _i1 ，……，C _iy ，……，C _ip }，C _iy ={C ¹ _iy ，……，C ^g _iy ，……，C ^zy _iy }，C ^g _iy =（C ^g1 _iy ，……，C ^ge _iy ，……，C ^gwg _iy ），C ^ge _iy Is at H _i In the corresponding y category target entity relation, the probability value of the e target character in the g initial sentence is g = 1\8230, wherein \ 8230is the number of the g initial sentences in the y category target entity relation, and e =1 \82308230wg, wg is the number of target characters in the g-th initial sentence.

S5093 according to C ^ge _iy ，F _iy The following conditions are met:

。

s511, when F _ip ≥F ₀ Then, H is determined _i Is a target text, wherein F ₀ Is a preset priority threshold.

S513, when F _ip <F ₀ Then, H is determined _i Is an exception text.

Compared with the first embodiment, the method and the device for predicting the entity relationship have the advantages that the probability of judging the target text into the abnormal text can be reduced, the accuracy of the model for predicting the entity relationship is further improved, compared with the second embodiment, the efficiency is improved, and the workload of annotating personnel is reduced.

Specifically, the target character is when U ^ry _ij ≥U ₀ While, U ^ry _ij The corresponding initial character.

Specifically, F ₀ The value range of (A) is 0.8-1.

Preferably, F ₀ The value is 0.8, and the situation that the judgment on the abnormal text is not accurate enough due to too low threshold setting can be avoided.

More preferably, F ₀ The value is 0.9, so that the abnormal text can be judged more accurately.

Most preferably, F ₀ The value is 1, and the initial text can be completely determined to be the abnormal text.

Specifically, the target text is text in which the priority of the entity relationship in the annotation text is not less than the priority threshold.

Specifically, the abnormal text is a text in which the priority of the entity relationship in the annotation text is smaller than a priority threshold, where the abnormal text may be understood as an initial text of an entity relationship type corresponding to the initial text in a preset entity relationship type, or the abnormal text may also be understood as an initial text in which an error is annotated to the entity relationship.

The embodiment provides a data processing system for determining an abnormal text, which comprises: a database, a processor, and a memory storing a computer program, wherein the database comprises: an initial text set, which when executed by a processor, performs the steps of: acquiring an initial sentence list and initial characters corresponding to the initial text according to the initial text; acquiring a corresponding entity relationship probability list according to the initial character, and performing traversal processing on the entity relationship probability list to acquire a target entity relationship probability list corresponding to the initial character; traversing the target entity relationship probability list to obtain the maximum probability value of the target entity relationship; acquiring the priority corresponding to the initial text according to the maximum probability value of the target entity relationship; when the priority is greater than or equal to a preset priority threshold, determining the initial text as a target text; when the priority is smaller than a preset priority threshold, determining the initial text as an abnormal text; therefore, on one hand, in the process of processing the text, the entity can be extracted by using only one preset model, so that the used text data is less, and the workload of marking personnel is reduced; on the other hand, in the extraction process of the entity relationship, the entity relationship can be extracted by using a plurality of methods, so that the prediction accuracy of the model on the entity relationship is improved.

Although some specific embodiments of the present invention have been described in detail by way of illustration, it should be understood by those skilled in the art that the above illustration is only for the purpose of illustration and is not intended to limit the scope of the invention. It will also be appreciated by those skilled in the art that various modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. A data processing system for determining anomalous text, said system comprising: a database, a processor, and a memory storing a computer program, wherein the database comprises: initial text list H = { H = { (H) ₁ ，……，H _i ，……，H _m }，H _i For the ith initial text, i =1 \ 8230; \8230m, m being the number of initial texts, the computer program, when being executed by a processor, realizes the following steps:

s100 according to H _i Obtaining H _i Corresponding initial sentence list D _i ={D _i1 ，……，D _ij ，……，D _ini }，D _ij =（D ¹ _ij ，……，D ^r _ij ，……，D ^sj _ij ），D ^r _ij Is H _i J =1 \8230, the r initial character of the j initial sentence, l 8230, ni, ni is H _i The number of the initial sentences in (1) r = 8230, the number of initial characters in (8230); sj, sj is the number of initial characters in the jth initial sentence;

s200, according to D ^r _ij Obtaining D ^r _ij Corresponding initial entity relationship list G ^r _ij ={G ^r1 _ij ，……，G ^rx _ij ，……，G ^rq _ij }，G ^rx _ij Is D ^r _ij The probability value of the corresponding x-th class initial entity relationship, x =1 \8230 \ 8230, q, q is the type number of the initial entity relationship;

s300, go through G ^r _ij And when G is ^rx _ij When the type of the corresponding initial entity relationship is the type of the non-target relationship, the secondary G ^r _ij Deletion in G ^rx _ij Construction of D ^r _ij Corresponding target entity relation list U ^r _ij ={U ^r1 _ij ，……，U ^ry _ij ，……，U ^rp _ij }，U ^ry _ij Is D ^r _ij The probability value of the corresponding y-th category target entity relationship, y =1 \8230, wherein \8230p, p and p are the type number of the target entity relationship;

s400, traversing U ^r _ij And when U is turned ^ry _ij ≥U ₀ Then, from U ^r _ij In order to obtain the maximum probability value, U, of the target entity relationship ₀ Is a preset probability value threshold;

2. The data processing system for determining abnormal texts according to claim 1, wherein in step S200, the initial entity relationship is an association relationship between entities in a triple corresponding to an initial sentence.

3. The data processing system for determining an abnormal text according to claim 1, wherein in step S200, Σ ^q _x=1 （G ^rx _ij ）=1。

4. The data processing system for determining abnormal text according to claim 1, wherein in the step S300, the non-target relationship is an unassociated relationship between entities.

5. The data processing system for determining an abnormal text according to claim 1, wherein in the step S400, U is ₀ The value range of (2) is 0.5 to 0.6.

6. The data processing system for determining an abnormal text according to claim 1, wherein the step S500 further comprises the steps of:

s501, based on U ^r _ij The maximum probability value of the medium target entity relationship is constructed into H _i Corresponding first intermediate data list V _i ={V _i1 ，……，V _ij ，……，V _ini }，V _ij =（V ¹ _ij ，……，V ^t _ij ，……，V ^kj _ij ），V ^t _ij Is H _i The probability value of the t target character in the jth initial statement, t =1 \ 8230 \8230: \ 8230j, kj is the number of the target characters in the jth initial statement;

；

s505, when F ⁰ _i ≥F ₀ Then, H is determined _i Is a target text, wherein F ₀ Is a preset priority threshold;

s507, when F ⁰ _i <F ₀ When H is determined _i Is an abnormal text.

7. The data processing system for determining an abnormal text according to claim 1, wherein the step S500 is performed by:

s501, obtaining H according to the maximum probability value of the target entity relationship _i Corresponding priority list F _i ={F _i1 ，……，F _iy ，……，F _ip }，F _iy Is H _i The corresponding priority of the y-th type target entity relation;

wherein F is further acquired in the step S501 through the following steps _iy ：

S5011, based on U ^r _ij The maximum probability value of the relation of the middle target entity is constructed into H _i Corresponding second intermediate data list C _i ={C _i1 ，……，C _iy ，……，C _ip }，C _iy ={C ¹ _iy ，……，C ^g _iy ，……，C ^zy _iy }，C ^g _iy =（C ^g1 _iy ，……，C ^ge _iy ，……，C ^gwg _iy ），C ^ge _iy Is at H _i In the corresponding y category target entity relationship, the probability value of the e target character in the g initial sentence is g =1 \8230, wherein \8230, zy is the number of the g initial sentences in the y category target entity relationship, e =1 \8230, \8230, wg and wg are the number of the target characters in the g initial sentences;

s5013 according to C ^ge _iy ，F _iy The following conditions are met:

；

s503, when F _ip ≥F ₀ Then, H is determined _i Is a target text, wherein F ₀ Is a preset priority threshold;

s505, when F _ip <F ₀ When H is determined _i Is an abnormal text.

8. Data processing system for determining anomalous text according to claim 6 or 7, characterised in that F ₀ The value range of (A) is 0.8-1.

9. The data processing system for determining abnormal texts according to the claim 6 or 7, wherein the target texts are texts with entity relationships in the annotation texts with priorities not less than a preset priority threshold.

10. The data processing system for determining abnormal text according to claim 6 or 7, wherein the abnormal text is a text in which the priority of the entity relationship in the annotation text is less than a preset priority threshold.