CN116384388B

CN116384388B - Method, device, equipment and medium for reverse identification AI intelligent writing

Info

Publication number: CN116384388B
Application number: CN202310108714.3A
Authority: CN
Inventors: 陈卫海
Original assignee: Shanghai Xijin Information Technology Co ltd
Current assignee: Shanghai Xijin Information Technology Co ltd
Priority date: 2023-02-14
Filing date: 2023-02-14
Publication date: 2024-02-02
Anticipated expiration: 2043-02-14
Also published as: CN116384388A

Abstract

The application relates to a method, a device, equipment and a medium for reverse identification AI intelligent writing. The method comprises the following steps: segmenting the text, and dividing sentences of each segment of text according to punctuation marks; when the number of clauses of the target segment text is larger than or equal to a first threshold value, removing common high-frequency sentences to form an input clause data set formed by a plurality of clauses; extracting feature values from each dimension of sentence confusion, sentence presentation dispersion, knowledge span, topic cohesiveness, emotion richness and non-civilized expression of the input clause data set respectively; carrying out data normalization processing on the feature values extracted from each dimension, and optimizing a support vector machine classifier by utilizing a normalized feature group to obtain a classification model; and inputting the text to be identified into the classification model, and classifying each text segment of the text to be identified into human authoring or machine authoring, so as to assist human selectively putting time and energy into the artificial intelligent authored works.

Description

Method, device, equipment and medium for reverse identification AI intelligent writing

Technical Field

The application relates to the technical field of reverse identification AI intelligent writing, in particular to a method, a device, computer equipment and a storage medium for reverse identification AI intelligent writing.

Background

With the progress of technology, AI intelligent writing is more and more excellent, and meanwhile, a large amount of garbage data is generated. The garbage data not only has no substantial creative value, but also brings great interference and waste of time and energy to normal work of people.

The reverse identification AI intelligent writing algorithm can provide a means for intelligently identifying whether an article or a section of text is authored by human or machine, thereby assisting the human to selectively invest time and energy in the work authored by artificial intelligence.

Therefore, how to implement a method for effectively distinguishing whether an article or a text is created by human or machine is an urgent technical problem to be solved.

Disclosure of Invention

Based on the above, it is necessary to provide a method, apparatus, computer device and storage medium for reverse identification AI intelligent writing, which can solve the technical problem that it is not possible to effectively distinguish whether an article or a text is created by human or by machine at present.

In one aspect, a method for reverse identification AI smart writing is provided, the method comprising:

segmenting the text to form an input paragraph data set D= { D composed of multiple segments of characters ₁ ,d ₂ ,d ₃ ,...,d _i ,...,d _m }；

Initializing the input paragraph data set D, and dividing each paragraph of characters according to punctuation marks; word d when the target segment _i When the number t of the clauses of the (2) is larger than or equal to a first threshold value, removing the common high-frequency sentence to form an input clause data set S formed by a plurality of clauses ¹ ＝{s ₁ ,s ₂ ,s ₃ ,...,s _i ,...,s _T -wherein T is less than or equal to T;

for the input clause dataset S ¹ Extracting characteristic values from each dimension of sentence confusion, sentence presentation dispersion, knowledge span, theme cohesiveness, emotion richness and non-civilized expression respectively;

carrying out data normalization processing on the feature values extracted by each dimension, mapping the feature values extracted by each dimension into the same interval (0, 1) to form a normalized feature group X= (X) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ )；

Optimizing a support vector machine classifier by utilizing the normalized feature set X, and finding out the most suitable kernel function through multiple rounds of training to obtain a classification model;

and inputting the text to be identified into the classification model, and classifying each text segment of the text to be identified as human authoring or machine authoring.

In one embodiment, after the initializing the input paragraph data set, the sentence dividing step is performed on each paragraph of text according to punctuation marks, the method further includes: and when the number of the clauses of the target segment text is smaller than a first threshold value, whether the target segment text is machine creation is not judged, and otherwise, the next step is executed.

In one embodiment, the step of extracting feature values from sentence confusion dimensions for the input clause dataset comprises:

using a crust word segmentation device to segment each text D in the input paragraph data set D _i Chinese word segmentation is carried out on each sentence of the number (1) to generate a word sequence, the length of the word sequence is recorded as N, the word element is recorded as WD, and the current sentence is recorded as S= { WD ₁ ,WD ₂ ,...,WD _N }；

Calculating the confusion of each sentence by a statistical-based language model confusion algorithm PP (S), wherein the calculation formula of the PP (S) is as followsWherein S represents the current sentence; n word sequence length; p (w) _i ) A probability representing an i-th word; p (w) _i |w ₁ w ₂ w ₃ …w _i-1 ) Representing the probability of calculating the ith word based on the previous i-1 words;

calculation of input clause dataset S Using PP (S) ¹ Each sentence s _i Is the confusion p of (2) _i Generating a sentence confusion degree set P= { P ₁ ,p ₂ ,p ₃ ,...,p _T }；

Obtaining the sentence confusion degree set P= { P ₁ ,p ₂ ,p ₃ ,...,p _T Maximum degree of confusion p in } _max If p _max If the text is smaller than or equal to the second threshold value, judging that the text is created by the machine; otherwise, entering the next step;

arranging the elements of the sentence confusion degree set P from small to large according to the confusion degree value, and taking the elements from 3T/4 to the tail to form a set Pa= { P _3T/4 ,...,p _T -a }; calculating an arithmetic mean of the set PaConfusion of the text >

In one embodiment, the step of extracting feature values from a statement presentation dispersion dimension for the input clause dataset comprises:

for the input clause dataset S ¹ Each clause s of (a) _i Calculating character length l _i Generating a clause length set L _s ＝{l ₁ ,l ₂ ,...,l _T }；

Calculating the set of clause lengths L _s Is of the dispersion of (2)Wherein->For the set L of clause lengths _s An average of all clause character lengths;

marking the dispersion of text paragraphs as Q _d ，Q _d ＝q _d 。

In one embodiment, the step of extracting feature values from the knowledge span dimension for the input clause dataset comprises:

using a pre-trained knowledge classification model to classify the input clause data set S ¹ Each clause s of (a) _i Classifying, and the corresponding classification set is C _s ＝{c ₁ ,.c ₂ ,...c _r -wherein r.ltoreq.T;

for class set C _s The element in the mixture is subjected to de-duplication treatment to obtain C _d ＝{c ₁ ,.c ₂ ,...c _R R.ltoreq.r;

calculating knowledge span K of text passage _d ＝card(C _d ) Wherein card (C _d ) Representing set C _d The number of elements in the list.

In one embodiment, the step of extracting feature values from the topic cohesiveness dimension for the input clause dataset comprises:

using word2vec model to input clause data set S ¹ Each clause s of (a) _i Calculating the semantic similarity of the sentence with other sentences;

Judging sentences with semantic similarity larger than a third threshold as a theme;

calculating the input clause data set S ¹ The number k of the topics in the list;

calculating topic cohesiveness T of text paragraphs _d ＝e ^1-k Wherein k is>0。

In one embodiment, the step of extracting feature values from the emotion richness dimension for the input clause dataset comprises:

calculating character richness, representing all characters except Chinese characters and English letters as a set Then calculate the target segment word d _i Contains a set F ₁ The number of characters in (a) is recorded as N _f1 ；

Calculating the richness of the language-gas assisted words and setting a language-gas assisted word set F ₂ = { A, o, la, w, o, bar, have, ou, if, mock, ou, ha, have, no }, calculate the target segment word d _i Contains a set F ₂ The number of the auxiliary words is recorded as N _f2 ；

Calculating emotion richness E of text paragraph _d ＝αN _f1 +βN _f2 Wherein alpha and beta are weight factors.

In one embodiment, the step of extracting feature values from the non-civilized term dimension for the input clause dataset comprises:

based on rule engine and knowledge base for said input clause data set S ¹ Each clause s of (a) _i Performing non-civilization calculation, and judging the target segment text d when any clause accords with the non-civilization rule _i The non-civilized value of (2) is marked as 1, otherwise marked as 0; recording device

Based on rule engine and knowledge base for said input clause data set S ¹ Each clause s of (a) _i Performing reverse blacking or quoting calculation on positive characters or events, and judging the target segment text d when any clause accords with the definition of a rule engine _i The non-civilized value of (2) is marked as 1, otherwise marked as 0; recording device

Non-civilized expression of a text paragraph is denoted as D _d ，D _d Is f ₁ And f ₂ Of (f) ₁ And f ₂ One of them is 1, D _d =1, otherwise D _d ＝0。

In one embodiment, the data normalization processing is performed on the feature values extracted from each dimension, and the feature values extracted from each dimension are mapped into the same interval (0, 1) to form a normalized feature set x= (X) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ) The method comprises the following steps:

obtaining the input clause data set S ¹ Feature values respectively extracted from each dimension of sentence confusion, sentence presentation dispersion, knowledge span, topic cohesiveness, emotion richness and non-civilized expression to form feature dimension X' = (P) _d ，Q _d ，K _d ，T _d ，E _d ，D _d )；

Using the formulaNormalizing the feature values extracted from each dimension, wherein x is _min For the minimum value in the corresponding feature set, x _max For the maximum value in the corresponding characteristic value set, x is the corresponding current characteristic value, x _norm Normalizing the processed value for the corresponding current characteristic value;

after processing, mapping each dimension characteristic value in a (0, 1) interval to form a normalized characteristic group X= (X) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ )。

On the other hand, provided is a reverse identification AI smart writing device, the device includes:

the segmentation processing module is used for carrying out segmentation processing on the text to form an input paragraph data set D= { D formed by a plurality of sections of characters ₁ ,d ₂ ,d ₃ ,...,d _i ,...,d _m }；

The clause processing module is used for initializing the input paragraph data set D and carrying out clause on each paragraph of characters according to punctuation marks; word d when the target segment _i When the number t of the clauses of the (2) is larger than or equal to a first threshold value, removing the common high-frequency sentence to form an input clause data set S formed by a plurality of clauses ¹ ＝{s ₁ ,s ₂ ,s ₃ ,...,s _i ,...,s _T -wherein T is less than or equal to T;

a module for extracting characteristic values for the input clause data set S ¹ Extracting characteristic values from each dimension of sentence confusion, sentence presentation dispersion, knowledge span, theme cohesiveness, emotion richness and non-civilized expression respectively;

the data normalization processing module is used for performing data normalization processing on the feature values extracted by each dimension, mapping the feature values extracted by each dimension into the same interval (0, 1) to form a normalized feature group X= (X) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ )；

The classification model acquisition module is used for optimizing the support vector machine classifier by utilizing the normalized feature set X, and finding out the most suitable kernel function through multiple rounds of training to obtain a classification model;

and the text recognition module is used for inputting the text to be recognized into the classification model, and classifying each text segment of the text to be recognized into human authoring or machine authoring.

In yet another aspect, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of:

In yet another aspect, a computer readable storage medium is provided, having stored thereon a computer program which when executed by a processor performs the steps of:

Initializing the input paragraph data set D, and dividing each paragraph of characters according to punctuation marks; word d when the target segment _i The number t of clauses of (2) is greater than or equal toRemoving common high-frequency sentences to form an input clause data set S formed by a plurality of clauses when the first threshold value is reached ¹ ＝{s ₁ ,s ₂ ,s ₃ ,...,s _i ,...,s _T -wherein T is less than or equal to T;

According to the method, the device, the computer equipment and the storage medium for reversely identifying the AI intelligent writing, based on the characteristics of the AI intelligent writing, the characteristic values of the data extracted by each dimension are reversely judged, so that whether an article or a section of text is created by a human or by a machine can be intelligently identified, and the human can be assisted to selectively input time and energy to the work created by the artificial intelligence.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an algorithm flow diagram of a method of reverse identification AI smart writing in one embodiment;

FIG. 2 is a flow chart of a method for reverse identification AI smart writing in another embodiment;

FIG. 3 is a flow diagram of the step of extracting feature values from sentence confusion dimensions for the input clause dataset, in one embodiment;

FIG. 4 is a flow diagram of the step of extracting feature values from a statement presentation dispersion dimension for the input clause dataset in one embodiment;

FIG. 5 is a flow diagram of the step of extracting feature values from knowledge span dimensions for the input clause dataset in one embodiment;

FIG. 6 is a flow diagram of the step of extracting feature values from a topic cohesiveness dimension for the input clause dataset, in one embodiment;

FIG. 7 is a flow diagram of a step of extracting feature values from emotion richness dimensions for the input clause dataset, in one embodiment;

FIG. 8 is a flow diagram of a step of extracting feature values from non-civilized phrase dimensions for the input phrase dataset in one embodiment;

fig. 9 illustrates a data normalization process for feature values extracted from each dimension, and maps the feature values extracted from each dimension into the same interval (0, 1) to form a normalized feature set x= (X) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ) A flow diagram of the steps;

FIG. 10 is a block diagram of a reverse identification AI smart writing apparatus in one embodiment;

FIG. 11 is an internal block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Example 1

In order to solve the problems pointed out by the background technology, the embodiment 1 of the invention creatively provides a method for reversely identifying the intelligent writing of the AI, wherein the intelligent writing algorithm for reversely identifying the AI is a means for intelligently identifying whether an article or a section of text is created by human or by a machine, thereby assisting the human to selectively input time and energy to the work created by the artificial intelligence.

Reverse recognition AI intelligent writing algorithm PQKTED (Ppl-QTd-knowledgespan-Theme-motion-Deviat ion from social core values) judges whether text is written by a machine based on artificial intelligent writing principle and feature reverse calculation. The PQKTE algorithm adopts six dimensions of sentence confusion (Ppl), sentence presentation dispersion (QTd), knowledge span (knowledgespan), theme cohesiveness (Theme), emotion richness (Emotion) and non-civilized term (Deviation from social core values) as characteristics, and adopts a supervised classifier to carry out model training and classification tasks, namely:

y＝f(x ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ )

Wherein:

y is a classification result;

f(x ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ) For the classifier (SVM, LSTM and other traditional machine learning or deep learning methods), different final accuracy and recall rate of the classifier are selected, and in the experiment, the SVM is adopted as the classifier;

x ₁ to the degree of confusion (Ppl), x ₂ As a dispersion (QTd), x ₃ For Knowledge span (knowledgespan), x ₄ Is the topic cohesiveness (Theme), x ₅ Is emotion richness (Emotion), x ₆ Is a non-civilized term (Deviat ion from social core values).

Specifically, as shown in fig. 1, fig. 1 is an algorithm flow chart of the method for reverse identification AI smart writing.

The algorithm has seven links, namely, segmentation processing, data preprocessing, data feature extraction, feature normalization processing, support vector machine classifier optimization and model training are respectively carried out on the text.

1. Text segmentation processing. First through warpSorting out a segmentation rule base by different presentation modes of test data or specific data, and then segmenting the text according to paragraphs, wherein the paragraphs are marked as D= { D ₁ ,d ₂ ,d ₃ ,...,d _m }。

2. And (5) preprocessing data.

a) Each d in pair 1 _i Clauses are carried out according to punctuation marks and are marked as S= { S ₁ ,s ₂ ,s ₃ ,...,s _T }. Setting a threshold value a, when n<a is not d _i A calculation is made of whether to author for the machine. Otherwise, the next step is carried out.

b) For s using the common high-frequency sentence database _i Filtering to remove sentences familiar to both human and machine, and generating new S ¹ ＝{s ₁ ,s ₂ ,s ₃ ,...,s _T And T.ltoreq.t.

3. And (5) extracting data characteristics. Data features are extracted from the following six dimensions, with no precedence order.

a) Confusion (Ppl). The AI intelligent writing judges whether the index of whether the sentence is smooth or not is the confusion degree of the sentence. Lower sentence confusion indicates more smooth sentences. In this model, the higher the confusion is considered to be the more likely to be human authoring, the lower the confusion is or the confusion always approaches a threshold b, but the threshold b is not always exceeded, and the AI intelligent authoring is considered to be possible. The confusion degree calculating process comprises the following steps: firstly, chinese word segmentation is carried out on sentences by using a barker word segmentation device to generate word sequences, the length of the word sequences is marked as N, and the word elements are marked as WD, namely

S＝{WD ₁ ,WD ₂ ,...,WD _N }。

Secondly, calculating sentence confusion through a statistical-based language model confusion algorithm PP (S), wherein the calculation formula is as follows:

wherein S represents the current sentence; n word sequence length; p (w) _i ) A probability representing an i-th word;

p(w _i |w ₁ w ₂ w ₃ …w _i-1 ) The probability of the i-th word is calculated based on the first i-1 words.

And a third step of: calculating set S using PP (S) ¹ Each s of (3) _i Is the confusion p of (2) _i Generating a sentence confusion degree set P= { P ₁ ,p ₂ ,p ₃ ,...,p _T }. Then calculate the confusion P of a section of characters _d 。

A) If p is _max <=b, then d _i The method is intelligent creation for AI. Otherwise, the next step is carried out.

B) The set P is processed as follows:

(1) arranging elements in the P set from small to large according to values;

(2) taking 3T/4 to last element to generate new set P _a ＝{p _3T/4 ,...,p _T }；

(3) Calculation of P _a Is the arithmetic mean of (a):

(4) meter with a meter body

b) Dispersion (QTd). The dispersion of the article paragraphs is noted as Q _d . Calculate Q _d The algorithm process is as follows:

(1) for set S ¹ Each s of (3) _i Calculating character length l _i Generating a new set L _s ＝{l ₁ ,l ₂ ,...,l _T }。

(2) Calculate L _s Is of the dispersion Q of (2) _d

Wherein the method comprises the steps ofIs the average.

(3) Meter Q _d ＝q _d

c) Knowledge span (knowledgespan). Record paragraph knowledge span as K _d 。

(1) The set S is obtained by using a Knowledge classification model (M-knowledges, including 22 classifications of sociology, mankind, economics, politics, histories, law and the like) pre-trained by oneself ¹ Classifying each sentence in the list, and classifying the corresponding classification set as follows:

C _s ＝{c ₁ ,.c ₂ ,...c _r -where r<＝n。

(2) For C _s The elements in the mixture are subjected to de-duplication treatment to obtain

C _d ＝{c ₁ ,.c ₂ ,...c _R (wherein R is)<＝r。

(3) Knowledge span K _d ＝card(C _d ) Wherein card (C _d ) Representing set C _d The number of elements in the list. The number of classifications involved in the text is used as the knowledge span.

d) Theme cohesiveness (Theme). The topic cohesiveness was noted as T _d . Calculate T _d The process of (2) is as follows:

(1) pair set S using word2vec model ¹ Each s of (3) _i And calculating the semantic similarity of the sentence with other sentences.

(2) Sentences with similarity greater than threshold B are treated as a topic

(3) S is calculated by (1) and (2) ¹ Number of topics k

(4) Subject cohesiveness is noted as T _d ：

T _d ＝e ^1-k Wherein k is>0；

e) Emotion richness (Emotion). Emotion richness is marked as E _d . Text condition richness is measured here in terms of character richness and word-of-speech richness in both dimensions.

(1) And (5) calculating character richness. All characters except Chinese and English letters, expressed as a set

Then calculate paragraph d _i Contains F ₁ The number of characters in the book is recorded as N _f1 。

(2) And (5) calculating the richness of the language and the auxiliary words. Word-aid set F ₂ = { a, o, la, w, bar, have, wa, if, do, ha, no. Then calculate paragraph d _i Contains F ₂ The number of the auxiliary words in the book is recorded as N _f2 。

(3) Emotion richness is recorded as

E _d ＝αN _f1 +βN _f2 Wherein alpha, beta is a weight factor

f) Non-civilized term (Deviation from social core values), denoted as D _d . The selection of non-civilizing and positive characters or events in the present model is either reverse blackened or calculated by referencing both dimensions.

(1) Non-civilized calculation. Based on the rule engine and knowledge base, when S ¹ Any one s _i The text d of the segment is processed as long as the non-civilized rule is met _i And (2) is marked as 1, otherwise as 0.

(2) Positive characters or events are reverse blacked out or quoted. Based on the rule engine and knowledge base, when S ¹ Any one s _i The non-civilized value of the segment of text di is marked as 1 as long as the rule engine definition is met, otherwise marked as 0.

③D _d Is f ₁ And f ₂ Of (1), i.e. as long as f ₁ And f ₂ One is 1, then D _d =1, otherwise D _d ＝0。

4. Data normalization

Feature dimension X' extracted in pair 3 = (P) _d ，Q _d ，K _d ，T _d ，E _d ，D _d ) Using the formula:

the feature of each dimension is normalized, after the normalization, the 6-dimension feature values are mapped in the (0, 1) interval, so that normalized feature group X= (X) is recorded ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ). Wherein x is _min For the minimum value in the corresponding feature set, x _max Is the maximum value in the corresponding eigenvalue set. X is the corresponding current characteristic value.

5. Model training and classifier tuning

Manual annotation data is obtained according to the following steps: 1:1 into a training set, a validation set and a test set. Wherein the training set is used for training the model and determining parameters; the verification set is used for adjusting parameters, monitoring whether the model is fitted or not, and being used for multiple times in the training process so as to continuously adjust the parameters; the test set is used to evaluate the generalization ability of the final model.

The support vector machine is a binary model algorithm whose basic model is the most widely spaced linear classifier defined in feature space.

The data of this task is linearly inseparable in two dimensions, but the problem of linear inseparability may be non-linearly inseparable, i.e. there is a hyperplane in the feature space separating the positive and negative classes. The use of nonlinear functions can map nonlinear separable problems from the original feature space to higher dimensional space to translate into linear separable problems, where the hyperplane as a decision boundary is represented as follows:

ω·φ(x)+b＝0

y(x)＝sign(w·φ(x)+b)

since the mapping function has a complex form, it is difficult to calculate its inner product, a kernel method can be used. Phi (x) in the formula is a kernel function.

The kernel function adopted by the task is a polynomial kernel, and the analytic formula is as follows:

when the order of the polynomial kernel is 1, it is called a linear kernel, and the corresponding nonlinear classifier is degenerated to a linear classifier.

6. Model evaluation

Models are typically evaluated from several dimensions, such as accuracy, precision, recall, F1 values, etc. The accuracy is that the number of correctly classified samples accounts for the number of total samples; the accuracy rate (P) is the proportion of the correct predicted positive example data to the predicted positive example data; the recall ratio (R) is the ratio of the predicted correct positive example data to the actual positive example data. F1 values are harmonic mean, f1=2/(1/p+1/R).

The dimensions to be concerned in the task are the precision, recall rate and F1 value, and particularly different dimensions can be concerned according to different usage scenes.

Model generalization evaluation is carried out by using the test set, and the following results are obtained:

	accuracy rate of	Recall rate of recall	F1-score
				PQKTED	0.76	0.93	0.803

From the statistical result, the accuracy is 76%, which represents that 76% of the results of all models are judged to be AI writing, and the actual results are AI writing; the recall rate was 93%, representing samples of AI writing in the test set, with 93% recalled. Therefore, the model has good generalization, and the PQKTED algorithm can effectively identify whether the text is authored by the AI.

Example 2

All the technical features of example 1 are included in example 2. Specifically, as shown in fig. 2, in embodiment 2, a method for reverse identification AI smart writing is provided, which includes the following steps:

step S1, segmenting the text to form an input paragraph data set D= { D composed of multiple segments of characters ₁ ,d ₂ ,d ₃ ,...,d _i ,...,d _m }；

Step S2, initializing the input paragraph data set D, and carrying out clause on each paragraph of characters according to punctuation marks; word d when the target segment _i When the number t of the clauses of the (2) is larger than or equal to a first threshold value, removing the common high-frequency sentence to form an input clause data set S formed by a plurality of clauses ¹ ＝{s ₁ ,s ₂ ,s ₃ ,...,s _i ,...,s _T -wherein T is less than or equal to T;

step S3, for the input clause dataset S ¹ Extracting feature values from each dimension of sentence confusion (Ppl), sentence presentation dispersion (QTd), knowledge span (knowledgespan), topic cohesiveness (Theme), emotion richness (motion) and non-civilization term (Deviation from social core values);

step S4, carrying out data normalization processing on the feature values extracted from each dimension, mapping the feature values extracted from each dimension into the same interval (0, 1) to form a normalizationNormalized feature set x= (X) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ )；

Step S5, optimizing a support vector machine classifier by utilizing the normalized feature set X, and finding out the most suitable kernel function through multiple rounds of training to obtain a classification model;

and S6, inputting the text to be recognized into the classification model, and classifying each text segment of the text to be recognized as human authoring or machine authoring.

In this embodiment, after the initializing the input paragraph data set, the sentence dividing step is performed on each paragraph of text according to punctuation marks, the method further includes: and when the number of the clauses of the target segment text is smaller than a first threshold value, whether the target segment text is machine creation is not judged, and otherwise, the next step is executed.

As shown in fig. 3, in this embodiment, the step of extracting feature values from sentence confusion dimensions for the input clause data set includes:

step S11, using a crust word segmentation device to segment each text D in the input paragraph data set D _i Chinese word segmentation is carried out on each sentence of the number (1) to generate a word sequence, the length of the word sequence is recorded as N, the word element is recorded as WD, and the current sentence is recorded as S= { WD ₁ ,WD ₂ ,...,WD _N }；

Step S12, calculating the confusion of each sentence by a statistical-based language model confusion algorithm PP (S), wherein the calculation formula of the PP (S) is as followsWherein S represents the current sentence; n word sequence length; p (w) _i ) A probability representing an i-th word; p (w) _i |w ₁ w ₂ w ₃ …w _i-1 ) Representing the probability of calculating the ith word based on the previous i-1 words;

step S13, calculating the input clause data set S by using the PP (S) ¹ Each sentence s _i Is the confusion p of (2) _i Generating a sentence confusion degree set P= { P ₁ ,p ₂ ,p ₃ ,...,p _T }；

Step S14, obtaining the sentence confusion degree set P= { P ₁ ,p ₂ ,p ₃ ,...,p _T Maximum degree of confusion p in } _max If p _max If the text is smaller than or equal to the second threshold value, judging that the text is created by the machine; otherwise, entering the next step;

step S15, arranging the elements of the sentence confusion degree set P from small to large according to the confusion degree value, and taking the elements from 3T/4 to the tail to form a set Pa= { P _3T/4 ,...,p _T -a }; calculating an arithmetic mean of the set PaAs confusion of the text, i.e. +.>

As shown in fig. 4, in the present embodiment, the step of extracting feature values from a statement presentation dispersion dimension for the input phrase data set includes:

step S21, for the input clause dataset S ¹ Each clause s of (a) _i Calculating character length l _i Generating a clause length set L _s ＝{l ₁ ,l ₂ ,...,l _T }；

Step S22, calculating the sentence length set L _s Is of the dispersion of (2)Wherein->For the set L of clause lengths _s An average of all clause character lengths;

step S23, marking the dispersion of the text passage as Q _d ，Q _d ＝q _d 。

As shown in fig. 5, in the present embodiment, the step of extracting the feature value from the knowledge span dimension for the input clause dataset includes:

step S31, utilizing a pre-trained knowledge classification modelFor the input clause dataset S ¹ Each clause s of (a) _i Classifying, and the corresponding classification set is C _s ＝{c ₁ ,.c ₂ ,...c _r -wherein r.ltoreq.T;

step S32, classifying the set C _s The element in the mixture is subjected to de-duplication treatment to obtain C _d ＝{c ₁ ,.c ₂ ,...c _R R.ltoreq.r;

step S33, calculating knowledge span K of text paragraph _d ＝card(C _d ) Wherein card (C _d ) Representing set C _d The number of elements in the list.

As shown in fig. 6, in the present embodiment, the step of extracting a feature value from the topic cohesiveness dimension for the input clause dataset includes:

Step S41, utilizing word2vec model to input clause data set S ¹ Each clause s of (a) _i Calculating the semantic similarity of the sentence with other sentences;

step S42, judging sentences with semantic similarity larger than a third threshold as a theme;

step S43, calculating the input clause dataset S ¹ The number k of the topics in the list;

step S44, calculating the topic cohesiveness T of the text paragraph _d ＝e ^1-k Wherein k is>0。

As shown in fig. 7, in this embodiment, the step of extracting a feature value from the emotion richness dimension for the input clause dataset includes:

step S51, calculating character richness, representing all characters except Chinese characters and English letters as a set Then calculate the target segment word d _i Contains a set F ₁ The number of characters in (a) is recorded as N _f1 ；

Step S52, calculating the richness of the language and the auxiliary words, and settingWord-aid set F ₂ = { A, o, la, w, o, bar, have, ou, if, mock, ou, ha, have, no }, calculate the target segment word d _i Contains a set F ₂ The number of the auxiliary words is recorded as N _f2 ；

Step S53, calculating emotion richness E of text paragraph _d ＝αN _f1 +βN _f2 Wherein alpha and beta are weight factors.

As shown in fig. 8, in the present embodiment, the step of extracting the feature value from the non-civilized term dimension for the input clause dataset includes:

Step S61, inputting a clause data set S based on a rule engine and a knowledge base ¹ Each clause s of (a) _i Performing non-civilization calculation, and judging the target segment text d when any clause accords with the non-civilization rule _i The non-civilized value of (2) is marked as 1, otherwise marked as 0; recording device

Step S62, inputting the clause dataset S based on a rule engine and a knowledge base ¹ Each clause s of (a) _i Performing reverse blacking or quoting calculation on positive characters or events, and judging the target segment text d when any clause accords with the definition of a rule engine _i The non-civilized value of (2) is marked as 1, otherwise marked as 0; recording device

Step S63, the non-civilized expression of the text paragraph is denoted as D _d ，D _d Is f ₁ And f ₂ Of (f) ₁ And f ₂ One of them is 1, D _d =1, otherwise D _d ＝0。

As shown in fig. 9, in this embodiment, the data normalization processing is performed on the feature values extracted in each dimension, and the feature values extracted in each dimension are mapped into the same interval (0, 1) to form a normalized feature group x= (X) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ) The method comprises the following steps:

step S71, obtaining the input clause data set S ¹ Feature values respectively extracted from each dimension of sentence confusion (Ppl), sentence presentation dispersion (QTd), knowledge span (knowledgespan), topic cohesiveness (Theme), emotion richness (motion) and non-civilization term (Deviation from social core values) to form feature dimension X' = (P) _d ，Q _d ，K _d ，T _d ，E _d ，D _d )；

Step S72, using the formulaNormalizing the feature values extracted from each dimension, wherein x is _min For the minimum value in the corresponding feature set, x _max For the maximum value in the corresponding characteristic value set, x is the corresponding current characteristic value, x _norm Normalizing the processed value for the corresponding current characteristic value;

step S73, mapping each dimension characteristic value in the (0, 1) interval after processing to form a normalized characteristic group X= (X) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ )。

In this embodiment, the step of optimizing the support vector machine classifier by using the normalized feature set X, finding out the most suitable kernel function through multiple rounds of training, and obtaining the classification model includes:

ω·φ(x)+b＝0

y(x)＝sign(w·φ(x)+b)

In this embodiment, the text to be identified is input into the classification model, and the classification model is further evaluated after each text segment of the text to be identified is classified as human authoring or machine authoring.

	Accuracy rate of	Recall rate of recall	F1-score
				PQKTED	0.76	0.93	0.803

According to the method for reversely identifying the AI intelligent writing, based on the characteristics of the AI intelligent writing, the characteristic values of the data extracted by each dimension are reversely judged, so that whether an article or a section of text is created by a human or by a machine can be intelligently identified, and the human is assisted to selectively input time and energy to the work created by the artificial intelligence.

It should be understood that, although the steps in the flowcharts of fig. 2-9 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps of fig. 2-9 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or stages are performed necessarily occur in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps or stages of other steps.

In one embodiment, as shown in fig. 10, there is provided a reverse identification AI-smart writing apparatus 10, including: the system comprises a segmentation processing module 1, a clause processing module 2, an extracting characteristic value module 3, a data normalization processing module 4, a classification model obtaining module 5 and a text recognition module 6.

The segmentation processing module 1 is configured to perform segmentation processing on a text to form an input paragraph data set d= { D composed of multiple segments of text ₁ ,d ₂ ,d ₃ ,...,d _i ,...,d _m }。

The clause processing module 2 is used for initializing the input paragraph data set D and carrying out clause on each paragraph of characters according to punctuation marks; word d when the target segment _i When the number t of the clauses of the (2) is larger than or equal to a first threshold value, removing the common high-frequency sentence to form an input clause data set S formed by a plurality of clauses ¹ ＝{s ₁ ,s ₂ ,s ₃ ,...,s _i ,...,s _T And T.ltoreq.t.

The extraction feature value module 3 is used for inputting the clause data set S ¹ Characteristic values are extracted from each dimension of sentence confusion (Ppl), sentence presentation dispersion (QTd), knowledge span (knowledgespan), topic cohesiveness (Theme), emotion richness (Emotion) and non-civilization expression (Deviation from social core values) respectively.

The data normalization processing module 4 is configured to perform data normalization processing on feature values extracted in each dimension, map the feature values extracted in each dimension into the same interval (0, 1), and form a normalized feature set x= (X) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ )。

The module 5 for obtaining the classification model is used for optimizing the support vector machine classifier by utilizing the normalized feature set X, and finding out the most suitable kernel function through multiple rounds of training to obtain the classification model.

The text recognition module 6 is configured to input the text to be recognized into the classification model, where each text segment of the text to be recognized is classified as human authored or machine authored.

In this embodiment, the step of extracting feature values from sentence confusion dimensions for the input clause dataset includes:

the elements of the sentence confusion degree set P are confusedThe degree values are arranged from small to large, and elements from 3T/4 to the tail are taken to form a set Pa= { p _3T/4 ,...,p _T -a }; calculating an arithmetic mean of the set PaConfusion of the text>

In this embodiment, the step of extracting feature values from the statement presentation dispersion dimension for the input phrase dataset includes:

marking the dispersion of text paragraphs as Q _d ，Q _d ＝q _d 。

In this embodiment, the step of extracting feature values from the knowledge span dimension for the input clause dataset includes:

In this embodiment, the step of extracting feature values from the topic cohesiveness dimension for the input clause dataset includes:

In this embodiment, the step of extracting a feature value from the emotion richness dimension for the input clause dataset includes:

calculating character richness, representing all characters except Chinese characters and English letters as a set

Then calculate the target segment word d _i Contains a set F ₁ The number of characters in (a) is recorded as N _f1 ；

In this embodiment, the step of extracting the feature value from the non-civilized term dimension for the input clause dataset includes:

In this embodiment, the data normalization processing is performed on the feature values extracted in each dimension, and the feature values extracted in each dimension are mapped into the same interval (0, 1) to form a normalized feature set x= (X) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ) The method comprises the following steps:

obtaining the input clause data set S ¹ Feature values respectively extracted from each dimension of sentence confusion (Ppl), sentence presentation dispersion (QTd), knowledge span (knowledgespan), topic cohesiveness (Theme), emotion richness (motion) and non-civilization term (Deviation from social core values) to form feature dimension X' = (P) _d ，Q _d ，K _d ，T _d ，E _d ，D _d )；

Using the formulaNormalizing the feature values extracted from each dimension, wherein x is _min For the minimum value in the corresponding feature set, x _max For the maximum value in the corresponding characteristic value set, x is the corresponding current characteristic value, x _norm Normalizing the processed value for the corresponding current characteristic value; />

According to the reverse identification AI intelligent writing device, based on the characteristics of AI intelligent writing, through carrying out reverse judgment on the data characteristic values extracted by each dimension, whether an article or a section of text is created by human or machine can be intelligently identified, so that the human can be assisted in selectively inputting time and energy to the work created by the artificial intelligence.

For specific limitations on the reverse-identification AI smart writing device, reference may be made to the above limitations on the method of reverse-identification AI smart writing, and no further description is given here. The above-mentioned each module in the reverse identification AI intelligent writing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 11. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing reverse identification AI intelligent writing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by the processor, implements a method of reverse identification AI smart writing.

It will be appreciated by those skilled in the art that the structure shown in fig. 11 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of when executing the computer program:

for the input clause dataset S ¹ Extracting feature values from each dimension of sentence confusion (Ppl), sentence presentation dispersion (QTd), knowledge span (knowledgespan), topic cohesiveness (Theme), emotion richness (motion) and non-civilization term (Deviation from social core values);

carrying out data normalization processing on the feature values extracted from each dimension, and mapping the feature values extracted from each dimension into the same interval (0, 1)Form normalized feature set x= (X) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ )；

arranging the elements of the sentence confusion degree set P from small to large according to the confusion degree value, and taking elements from 3n/4 to the tail to form a set Pa= { P _3T/4 ,...,p _T -a }; calculating an arithmetic mean of the set PaConfusion of the text>

marking the dispersion of text paragraphs as Q _d ，Q _d ＝q _d 。

using a pre-trained knowledge classification model to classify the input clause data set S ¹ Each clause s of (a) _i Classifying, and the corresponding classification set is C _s ＝{c ₁ ,.c ₂ ,...c _r }，Wherein r is less than or equal to T;

Obtaining the input clause data set S ¹ From sentence confusion (Ppl), sentence presentation dispersion (QTd), knowledge span (knowledgespan), topic cohesiveness (Theme), emotion richness (Emotio)n) and the feature values respectively extracted from each dimension of the non-civilized term (Deviation from social core values) to form a feature dimension X' = (P) _d ，Q _d ，K _d ，T _d ，E _d ，D _d )；

For specific limitations regarding implementation steps when the processor executes the computer program, reference may be made to the above limitation on the method for intelligently writing the reverse identification AI, which is not described herein.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

for the input clause dataset S ¹ From sentence confusion (Ppl), sentence presentation dispersion (QTd), knowledge span (knowledgespan), topicsExtracting feature values from each dimension of the aggregation (Theme), emotion richness (Emotion) and non-civilization term (Deviation from social core values);

by PP (S) calculating the input clause dataset S ¹ Each sentence s _i Is the confusion p of (2) _i Generating a sentence confusion degree set P= { P ₁ ,p ₂ ,p ₃ ,...,p _T }；

arranging the elements of the sentence confusion degree set P from small to large according to the confusion degree value, and taking the elements from 3T/4 to the tail to form a set Pa= { P _3T/4 ,...,p _T -a }; calculating an arithmetic mean of the set PaConfusion of the text >/>

marking the dispersion of text paragraphs as Q _d ，Q _d ＝q _d 。

Extracting feature values from the non-civilized term dimension for the input clause dataset, comprising:

The method for reverse identification AI intelligent authoring according to claim 1, wherein the feature values extracted from each dimension are subjected to data normalization processing, and the feature values extracted from each dimension are mapped into the same interval (0, 1) to formNormalized feature set x= (X) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ) The method comprises the following steps:

For specific limitations regarding implementation steps of the computer program when executed by the processor, reference may be made to the above limitation of the method for intelligent writing of reverse identification AI, which is not described in detail herein.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method for reverse identification AI intelligent authoring, comprising:

inputting the text to be identified into the classification model, and classifying each text segment of the text to be identified into human authoring or machine authoring;

the step of extracting the feature value from the sentence confusion degree dimension of the input clause data set comprises the following steps:

Obtaining the sentence confusion degree setP＝{p ₁ ,p ₂ ,p ₃ ,...,p _T Maximum degree of confusion p in } _max If p _max If the text is smaller than or equal to the second threshold value, judging that the text is created by the machine; otherwise, entering the next step;

arranging the elements of the sentence confusion degree set P from small to large according to the confusion degree value, and taking the elements from 3T/4 to the tail to form the set P _a ＝{p _3T/4 ,...,p _T -a }; calculating an arithmetic mean of the set PaConfusion of the text

Wherein the step of extracting feature values from a statement presentation dispersion dimension for the input clause dataset comprises: for the input clause dataset S ¹ Each clause s of (a) _i Calculating character length l _i Generating a clause length set L _s ＝{l ₁ ,l ₂ ,...,l _T }；

marking the dispersion of text paragraphs as Q _d ，Q _d ＝q _d 。

2. The method for reverse-identification AI-intelligent authoring according to claim 1, further comprising, after said initializing said input paragraph dataset, a sentence-dividing step for each paragraph of text according to punctuation marks: and when the number of the clauses of the target segment text is smaller than a first threshold value, whether the target segment text is machine creation is not judged, and otherwise, the next step is executed.

3. The method of reverse recognition AI-smart authoring of claim 1, wherein extracting feature values from a knowledge span dimension for said input clause dataset comprises:

4. The method of reverse recognition AI-smart authoring of claim 1, wherein extracting feature values from a topic-cohesiveness dimension for said input clause dataset comprises:

5. The method of reverse recognition AI-smart authoring of claim 1, wherein extracting feature values from emotion-rich dimensions for said input clause dataset comprises:

Calculating character richness, except for Chinese characters and EnglishAll characters other than letters, denoted as set F ₁ The method comprises the steps of carrying out a first treatment on the surface of the Then calculate the target segment word d _i Contains a set F ₁ The number of characters in (a) is recorded as N _f1 ；

6. The method of reverse recognition AI-smart authoring of claim 1, wherein extracting feature values from a non-textual term dimension for said input clause dataset comprises:

7. The method for reverse identification AI intelligent authoring according to claim 1, wherein the data normalization processing is performed on the feature values extracted from each dimension, and the feature values extracted from each dimension are mapped into the same interval (0, 1) to form a normalized feature group x= (X) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ) The method comprises the following steps:

8. An intelligent reverse identification AI writing device, comprising:

a module for extracting characteristic values for the input clause data set S ¹ Extracting characteristic values from each dimension of sentence confusion, sentence presentation dispersion, knowledge span, theme cohesiveness, emotion richness and non-civilized expression respectively; the step of extracting the feature value from the sentence confusion degree dimension of the input clause data set comprises the following steps: using a crust word segmentation device to segment each text D in the input paragraph data set D _i Chinese word segmentation is carried out on each sentence of the number (1) to generate a word sequence, the length of the word sequence is recorded as N, the word element is recorded as WD, and the current sentence is recorded as S= { WD ₁ ,WD ₂ ,...,WD _N -a }; calculating the confusion of each sentence by a statistical-based language model confusion algorithm PP (S), wherein the calculation formula of the PP (S) is as follows Wherein S represents the current sentence; n word sequence length; p (w) _i ) A probability representing an i-th word; p (w) _i |w ₁ w ₂ w ₃ …w _i-1 ) Representing the probability of calculating the ith word based on the previous i-1 words; calculation of input clause dataset S Using PP (S) ¹ Each sentence s _i Is the confusion p of (2) _i Generating a sentence confusion degree set P= { P ₁ ,p ₂ ,p ₃ ,...,p _T -a }; obtaining the sentence confusion degree set P= { P ₁ ,p ₂ ,p ₃ ,...,p _T Maximum degree of confusion p in } _max If p _max If the text is smaller than or equal to the second threshold value, judging that the text is created by the machine; otherwise, entering the next step; arranging the elements of the sentence confusion degree set P from small to large according to the confusion degree value, and taking the elements from 3T/4 to the tail to form the set P _a ＝{p _3T/4 ,...,p _T -a }; calculating the arithmetic mean of set Pa>Confusion of the text>Wherein the step of extracting feature values from a statement presentation dispersion dimension for the input clause dataset comprises: for the input clause dataset S ¹ Each clause s of (a) _i Calculating character length l _i Generating a clause length set L _s ＝{l ₁ ,l ₂ ,...,l _T -a }; calculating the set of clause lengths L _s Is>Wherein->For the set L of clause lengths _s An average of all clause character lengths; marking the dispersion of text paragraphs as Q _d ，Q _d ＝q _d ；

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.