CN111144709B

CN111144709B - Method and device for determining novelty of machine-generated text

Info

Publication number: CN111144709B
Application number: CN201911244272.5A
Authority: CN
Inventors: 张熙; 靳凯夫; 李小勇; 方滨兴
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2023-04-18
Anticipated expiration: 2039-12-06
Also published as: CN111144709A

Abstract

The embodiment of the invention provides a method and a device for determining novelty of a machine-generated text, wherein the method comprises the following steps: acquiring a machine-generated text and a plurality of reference texts corresponding to the machine-generated text; determining an overlapping factor of the machine-generated text according to the words included in the machine-generated text and the words included in the plurality of reference texts; determining a repeated penalty factor of the machine-generated text according to the short sentence included in the machine-generated text; determining a length penalty factor of the machine-generated text according to the text length of the machine-generated text, the average text length of the plurality of reference texts and the minimum text length of the plurality of reference texts; and determining the novelty of the machine-generated text according to the overlapping factor, the repetition penalty factor and the length penalty factor of the machine-generated text. The overlapping degree of the machine-generated text and the reference text, the repetition degree of the machine-generated text and the length factors of the machine-generated text and the reference text are comprehensively considered, and the novelty of the machine-generated text is more effectively measured.

Description

Method and device for determining novelty of machine-generated text

Technical Field

The invention relates to the technical field of machine learning, in particular to a method and a device for determining novelty of a machine-generated text.

Background

With the development of artificial intelligence technology, the quality requirements of some natural language generation tasks on machine-generated texts are continuously improved. For example, in the fields of machine translation, human-computer conversation, and the like, higher quality requirements are placed on machine-generated texts.

The criteria for measuring the quality of machine-generated text mainly include the following three aspects: relevancy, language quality and novelty. The relevance represents the relevance degree of the machine-generated text and the reference text, such as the relevance degree of a machine translation result and an expert translation result in a machine translation task; the language quality represents the compliance degree of the machine-generated text in terms of sentence structure and grammar; novelty expresses how distinctive machine-generated text differs from reference text or other machine-generated text.

At present, a determination method for the relevance and language quality of a machine-generated text with better performance exists, but for the novelty of the machine-generated text, no determination method exists at present, and the novelty of the machine-generated text cannot be accurately determined, so that the quality of the machine-generated text cannot be accurately measured.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for determining the novelty of a machine-generated text, so as to more accurately determine the novelty of the machine-generated text and further accurately measure the quality of the machine-generated text. The specific technical scheme is as follows:

in order to achieve the above object, an embodiment of the present invention provides a method for determining novelty of a machine-generated text, where the method includes:

acquiring a machine-generated text and a plurality of reference texts corresponding to the machine-generated text;

determining an overlapping factor of the machine-generated text according to words included in the machine-generated text and words included in the multiple reference texts, wherein the words are words obtained by segmenting the text according to a preset segmentation length;

determining a repeated punishment factor of the machine-generated text according to a short sentence included in the machine-generated text, wherein the short sentence is a sentence obtained by segmenting the text according to a preset separator;

determining a length penalty factor of the machine-generated text according to the text length of the machine-generated text, the average text length of the plurality of reference texts and the minimum text length of the plurality of reference texts;

and determining the novelty of the machine-generated text according to the overlapping factor, the repetition penalty factor and the length penalty factor of the machine-generated text.

Optionally, the step of determining an overlap factor of the machine-generated text according to the words included in the machine-generated text and the words included in the multiple reference texts includes:

for each preset segmentation length, determining an overlap factor corresponding to the preset segmentation length according to words corresponding to the preset segmentation length included in the machine-generated text and words corresponding to the preset segmentation length included in the multiple reference texts, wherein the words corresponding to the preset segmentation length are words obtained by segmenting the text according to the preset segmentation length;

and carrying out weighted summation on the overlapping factors corresponding to each preset segmentation length based on the preset weight of each preset segmentation length to obtain the overlapping factors of the machine-generated text.

Optionally, the machine-generated text is multiple;

the step of determining the overlap factor corresponding to each preset segmentation length according to the word corresponding to the preset segmentation length included in the machine-generated text and the word corresponding to the preset segmentation length included in the multiple reference texts includes:

counting a first number of words corresponding to each preset segmentation length included in each machine-generated text and a second number of words corresponding to each preset segmentation length included in the plurality of reference texts corresponding to the machine-generated text, aiming at each preset segmentation length;

and determining an overlapping factor corresponding to each preset segmentation length based on the preset parameters and the first number and the second number of the words corresponding to each preset segmentation length.

Optionally, the step of determining the overlap factor corresponding to each preset segmentation length based on the preset parameter, the first number and the second number of the words corresponding to each preset segmentation length includes:

aiming at each preset segmentation length, calculating an overlapping factor corresponding to the preset segmentation length according to the following formula:

wherein n represents a preset segmentation length, candidates represent the multiple machine-generated texts, references represent multiple reference texts of a machine-generated text C, r represents one reference text of the multiple reference texts, n-gram represents a word with a preset segmentation length of n, C represents the machine-generated text C, λ represents the preset parameter, and Count _C (n-gram) represents the first number of words, count, corresponding to the preset segmentation length n of the machine-generated text c _c-ref (n-gram) represents the number of words corresponding to the preset segmentation length n of the reference text corresponding to the machine-generated text c, delta represents the second number of words corresponding to the preset segmentation length n of the plurality of reference texts corresponding to the machine-generated text c, and P _n And representing the overlapping factor corresponding to the preset segmentation length n.

Optionally, the step of performing weighted summation on the overlap factor corresponding to each preset segmentation length based on the preset weight of each preset segmentation length to obtain the overlap factor of the machine-generated text includes:

calculating an overlap factor for the machine-generated text according to the following formula:

wherein, P _avg An overlap factor, P, representing the machine-generated text _n Representing the overlap factor, w, corresponding to a predetermined slicing length, n _n And the preset weight of the preset segmentation length N is represented, and the N represents the total number of the preset segmentation lengths.

Optionally, the step of determining a repetition penalty factor of the machine-generated text according to the clause included in the machine-generated text includes:

determining short sentences contained in the machine-generated text;

and calculating the similarity between short sentences contained in the machine-generated text, and determining a repetition penalty factor of the machine-generated text based on the similarity between the short sentences.

Optionally, the step of determining a length penalty factor of the machine-generated text according to the text length of the machine-generated text, the average text length of the multiple reference texts, and the minimum text length of the multiple reference texts includes:

acquiring the text length of the machine-generated text, the average text length of the plurality of reference texts and the minimum text length of the plurality of reference texts;

determining a length penalty factor for the machine-generated text according to the following formula:

where C represents the machine-generated text C, l _C A text length representing the machine-generated text c,

an average text length of a plurality of reference texts representing the machine generated text c, <' >>

A minimum text length of a plurality of reference texts representing the machine-generated text C, and phi (C) represents a length penalty factor for the machine-generated text C.

Optionally, the step of determining the novelty of the machine-generated text according to the overlap factor, the repetition penalty factor and the length penalty factor of the machine-generated text includes:

and multiplying the overlapping factor, the repeated penalty factor and the length penalty factor in sequence to obtain the novelty of the machine-generated text.

Optionally, the method further includes:

when the determined novelty of the machine-generated text is greater than a preset novelty threshold, determining the machine-generated text as recommendable machine-generated text.

To achieve the above object, an embodiment of the present invention provides a novelty determining apparatus for machine-generated text, including:

the acquisition module is used for acquiring a machine generation text and a plurality of reference texts corresponding to the machine generation text;

the first determining module is used for determining an overlapping factor of the machine-generated text according to words included in the machine-generated text and words included in the reference texts, wherein the words are words obtained by segmenting the text according to a preset segmentation length;

the second determining module is used for determining a repeated punishment factor of the machine-generated text according to a short sentence included in the machine-generated text, wherein the short sentence is a sentence obtained by segmenting the text according to a preset separator;

a third determining module, configured to determine a length penalty factor of the machine-generated text according to a text length of the machine-generated text, an average text length of the multiple reference texts, and a minimum text length of the multiple reference texts;

and the fourth determining module is used for determining the novelty of the machine-generated text according to the overlapping factor, the repetition penalty factor and the length penalty factor of the machine-generated text.

Optionally, the preset segmentation lengths are multiple, and the first determining module is specifically configured to:

for each preset segmentation length, determining an overlapping factor corresponding to the preset segmentation length according to words corresponding to the preset segmentation length included in the machine-generated text and words corresponding to the preset segmentation length included in the multiple reference texts, wherein the words corresponding to the preset segmentation length are words obtained by segmenting the text according to the preset segmentation length;

and carrying out weighted summation on the overlapping factors corresponding to each preset segmentation length based on the preset weight of each preset segmentation length to obtain the overlapping factors of the text generated by the machine.

Optionally, the number of the machine-generated texts is multiple, and the first determining module is specifically configured to:

Optionally, the first determining module is specifically configured to:

Optionally, the first determining module is specifically configured to:

Optionally, the second determining module is specifically configured to:

determining short sentences contained in the machine-generated text;

Optionally, the third determining module is specifically configured to:

wherein C represents a machine-generated textC, l of the present _C A text length representing the machine-generated text c,

Optionally, the fourth determining module is specifically configured to:

Optionally, the apparatus further comprises: a fifth determination module to:

In order to achieve the above object, an embodiment of the present invention further provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the steps of the novelty determination method of any machine-generated text when executing the program stored in the memory.

In order to achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements any of the above method steps.

By applying the method and the device for determining the novelty of the machine-generated text, the machine-generated text and a plurality of reference texts corresponding to the machine-generated text are obtained; determining an overlapping factor of the machine-generated text according to words included in the machine-generated text and words included in the reference texts, wherein the words are words obtained by segmenting the text according to a preset segmentation length; determining a repeated punishment factor of the machine-generated text according to a short sentence included in the machine-generated text, wherein the short sentence is a sentence obtained by segmenting the text according to a preset separator; determining a length penalty factor of the machine-generated text according to the text length of the machine-generated text, the average text length of the plurality of reference texts and the minimum text length of the plurality of reference texts; and determining the novelty of the machine-generated text according to the overlapping factor, the repetition penalty factor and the length penalty factor of the machine-generated text. Therefore, factors such as the overlapping degree of the machine-generated text and the reference text, the repetition degree of the machine-generated text, the lengths of the machine-generated text and the reference text and the like are comprehensively considered, and the novelty of the machine-generated text can be more effectively measured.

Of course, it is not necessary for any product or method of practicing the invention to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for determining novelty of a machine-generated text according to an embodiment of the present invention;

FIG. 2 is a partial flow diagram of a method for determining novelty of machine-generated text in accordance with an embodiment of the present invention;

FIG. 3 is a schematic flow chart of another part of a method for determining novelty of machine-generated text according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a novelty determining apparatus for machine-generated text according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to more accurately determine the novelty of a machine-generated text and further accurately measure the quality of the machine-generated text, the embodiment of the invention provides a method and a device for determining the novelty of the machine-generated text, an electronic device and a computer-readable storage medium.

Referring to fig. 1, fig. 1 is a schematic flowchart of a method for determining novelty of a machine-generated text according to an embodiment of the present invention, where the method includes the following steps:

s101: and acquiring a machine-generated text and a plurality of reference texts corresponding to the machine-generated text.

In an embodiment of the present invention, the machine-generated text may be a text generated in a natural language generation task, for example, a machine-generated text in a machine translation and human-computer conversation scenario.

In order to measure the novelty of the machine-generated text, a corresponding reference text may be acquired, and for one machine-generated text, a plurality of reference texts may be acquired. The reference text can be obtained according to actual requirements, for example, in the field of machine translation, if the machine-generated text to be measured for novelty is the text generated by machine translation, the reference text can be an expert translation text.

S102: determining an overlapping factor of the machine-generated text according to words included in the machine-generated text and words included in the plurality of reference texts, wherein the words are words obtained by segmenting the text according to a preset segmentation length.

In the embodiment of the invention, the overlapping factor of the machine-generated text can be determined according to the words included in the machine-generated text and the words included in the multiple reference texts of the machine-generated text, wherein the words are words obtained by segmenting the text according to the preset segmentation length.

In the embodiment of the invention, the text is segmented word by word according to the preset segmentation length, as an example, if the text is 'utility model patent', and the preset segmentation length is 3, words after segmentation can be 'utility model', 'novel special', 'type patent' respectively.

S103: and determining a repeated punishment factor of the machine-generated text according to a short sentence included in the machine-generated text, wherein the short sentence is a sentence obtained by segmenting the text according to a preset separator.

In the embodiment of the invention, aiming at the machine-generated text, the repetition degree of the text can be calculated. The higher the degree of repetition, the lower the degree of novelty.

Specifically, the repeated penalty factor of the machine-generated text may be determined according to a short sentence included in the machine-generated text, where the short sentence is a sentence obtained by segmenting the text according to a preset delimiter. For example, commas and semicolons can be used as the preset delimiters.

S104: and determining a length penalty factor of the machine-generated text according to the text length of the machine-generated text, the average text length of the plurality of reference texts and the minimum text length of the plurality of reference texts.

In the embodiment of the invention, the novelty of the machine-generated text is measured, and the text lengths of the machine-generated text and the reference text can be further considered.

Specifically, a length penalty factor of the machine-generated text is determined according to the text length of the machine-generated text, the average text length of the plurality of reference texts and the minimum text length of the plurality of reference texts.

S105: and determining the novelty of the machine-generated text according to the overlapping factor, the repetition penalty factor and the length penalty factor of the machine-generated text.

In the embodiment of the invention, the novelty of the machine-generated text can be determined by integrating the overlapping factor, the repetition penalty factor and the length penalty factor of the machine-generated text.

By applying the method and the device for determining the novelty of the machine-generated text, the machine-generated text and a plurality of reference texts corresponding to the machine-generated text are obtained; determining an overlapping factor of the machine-generated text according to words included in the machine-generated text and words included in the plurality of reference texts, wherein the words are words obtained by segmenting the text according to a preset segmentation length; determining a repeated punishment factor of the machine-generated text according to a short sentence included in the machine-generated text and short sentences included in a plurality of reference texts, wherein the short sentence is a sentence obtained by segmenting the text according to a preset separator; determining a length penalty factor of the machine-generated text according to the text length of the machine-generated text, the average text length of the plurality of reference texts and the minimum text length of the plurality of reference texts; and determining the novelty of the machine-generated text according to the overlapping factor, the repetition penalty factor and the length penalty factor of the machine-generated text. Therefore, factors such as the overlapping degree of the machine-generated text and the reference text, the repetition degree of the machine-generated text, the lengths of the machine-generated text and the reference text and the like are comprehensively considered, and the novelty of the machine-generated text can be more effectively measured.

In an embodiment of the present invention, if the preset slicing length may be multiple, referring to fig. 2, the step S102 may specifically include the following steps:

s21: and determining an overlapping factor corresponding to the preset segmentation length according to the word corresponding to the preset segmentation length included in the machine-generated text and the words corresponding to the preset segmentation length included in the plurality of reference texts, wherein the word corresponding to the preset segmentation length is a word obtained by segmenting the text according to the preset segmentation length.

In an embodiment of the present invention, for example, if the preset slicing lengths are 4, which are 1,2,3 and 4, respectively, then a corresponding overlap factor is calculated for each preset slicing length.

Specifically, for a constraint preset segmentation length, the overlap factor corresponding to the preset segmentation length may be determined according to the number of words corresponding to the preset segmentation length included in the machine-generated text and the number of words corresponding to the preset segmentation length included in the plurality of reference texts.

Further, in an embodiment of the present invention, the number of the machine-generated texts to be measured may be multiple, and each machine-generated text corresponds to multiple reference texts.

As an example, the machine-generated text may be represented in the form of a list, and assuming that there are n machine-generated texts to be measured, it may be represented as a = [ a1, a2.. Ai.. An ], (1 ≦ i ≦ n), where ai represents the ith machine-generated text. Assuming that there are m reference texts per machine-generated text, the m reference texts corresponding to ai can be represented as Bi = [ Bi1, bi2.. Bim ].

The step S21 may specifically include the following steps S211 to S212, see fig. 3.

S211: and counting a first number of words corresponding to the preset segmentation length included in each machine-generated text and a second number of words corresponding to the preset segmentation length included in a plurality of reference texts corresponding to the machine-generated text aiming at each preset segmentation length.

In conjunction with the above example, the machine-generated text is a1, a2.. Ai.. An, and the m reference texts corresponding to ai are Bi1, bi2.. Bim.

Then, for each preset segmentation length, a first number of words included in the machine-generated text a1 and corresponding to the preset segmentation length may be counted, and a second number of words included in the m reference texts corresponding to the machine-generated text a1 and corresponding to the preset segmentation length may be counted. In addition, a first number of words corresponding to the preset segmentation length included in the machine-generated text a2 and a second number of words corresponding to the preset segmentation length included in the m reference texts corresponding to the machine-generated text a2 are counted. And analogizing in sequence until a first number of words corresponding to the preset segmentation length included in the machine-generated text an and a second number of words corresponding to the preset segmentation length included in the m reference texts corresponding to the machine-generated text an are counted.

S212: and determining an overlapping factor corresponding to each preset segmentation length based on the preset parameters and the first number and the second number of the words corresponding to each preset segmentation length.

In an embodiment of the present invention, after counting the first number and the second number of words corresponding to each preset segmentation length, the data may be combined to determine the overlap factor corresponding to each preset segmentation length.

In an embodiment of the present invention, for each preset segmentation length, an overlap factor corresponding to the preset segmentation length may be calculated according to the following formula:

wherein n represents a preset segmentation length, candidates represent the multiple machine-generated texts, references represent multiple reference texts of a machine-generated text C, r represents one reference text of the multiple reference texts, n-gram represents a word with a preset segmentation length of n, C represents the machine-generated text C, λ represents the preset parameter, and Count _C (n-gram) representing words corresponding to preset segmentation length n of machine-generated text cFirst number, count _c-ref (n-gram) represents the number of words corresponding to the preset segmentation length n of the reference text corresponding to the machine-generated text c, delta represents the second number of words corresponding to the preset segmentation length n of the plurality of reference texts corresponding to the machine-generated text c, and P _n And representing the overlapping factor corresponding to the preset segmentation length n.

Wherein n represents a preset segmentation length, candidates represent a plurality of machine-generated texts, references represent a plurality of reference texts, r represents one of the reference texts, n-gram represents a word with the preset segmentation length of n, C represents a machine-generated text C, and lambda represents a preset parameter which can be set according to actual conditions, the value of the parameter can be between 0 and 1, and Count _C (n-gram) represents the first number of words, count, corresponding to the preset segmentation length n of the machine-generated text c _c-ref (n-gram) represents a second number of words, P, corresponding to a preset segmentation length n of a reference text corresponding to the machine-generated text c _n And representing the overlapping factor corresponding to the preset segmentation length n.

S22: and based on the preset weight of each preset segmentation length, carrying out weighted summation on the overlapping factors corresponding to each preset segmentation length to obtain the overlapping factors of the machine-generated text.

After the overlap factor corresponding to each preset segmentation length is determined, the overlap factor corresponding to each preset segmentation length can be weighted and summed based on the preset weight of each preset segmentation length, so that the overlap factor of the whole machine-generated text is obtained.

In one embodiment of the invention, the overlap factor for machine-generated text may be calculated as follows:

wherein, P _avg An overlap factor, P, representing machine-generated text _n Representing the overlap factor, w, corresponding to a predetermined slicing length, n _n And the preset weight of the preset segmentation length N is represented, and the N represents the total number of the preset segmentation lengths.

Therefore, in the embodiment of the invention, the overlapping factors corresponding to the preset segmentation lengths are integrated, and the overlapping factor of the whole text generated by the machine is calculated.

In one embodiment of the present invention, step S103: determining a repetition penalty factor of the machine-generated text according to a short sentence included in the machine-generated text, which specifically includes the following refining steps:

determining short sentences contained in the machine-generated text;

Specifically, the machine-generated text may be divided by using delimiters such as commas and semicolons to obtain a plurality of short sentences, and the similarity between every two short sentences is calculated and averaged to be used as a repeated penalty factor of the machine-generated text.

As an example, if three short sentences a, b, and c are obtained after the machine-generated text is divided, the similarity of the short sentences a and b, the similarity of the short sentences a and c, and the similarity of the short sentences b and c may be calculated respectively, and then the three similarities are averaged to be used as the repetition penalty factor of the machine-generated text.

The process of calculating the similarity between phrases can be referred to in the related art. For example, the calculation may be performed by using an existing BLEU (Bilingual Evaluation understatus) algorithm, which is not described in detail herein.

In one embodiment of the present invention, step S104: determining a repetition penalty factor of the machine-generated text according to a short sentence included in the machine-generated text, which may specifically include the following steps:

acquiring the text length of a machine-generated text, the average text length of a plurality of reference texts and the minimum text length of the plurality of reference texts;

wherein C represents a machine-generated text C, the machine-generated text C has generality and can represent any machine-generated text, and l _C A text length representing the machine-generated text c,

average text length of a plurality of reference texts representing machine-generated text c, based on the text length of the text in question, and based on the text length of the text in question>

A minimum text length of a plurality of reference texts representing machine-generated text C, and phi (C) represents a length penalty factor for machine-generated text C.

In an embodiment of the present invention, step S105 may specifically include: and multiplying the overlapping factor, the repeated penalty factor and the length penalty factor in sequence to obtain the novelty of the machine-generated text.

Therefore, the novelty of the machine-generated text can be measured more effectively by comprehensively considering the overlapping degree of the machine-generated text and the reference text, the repetition degree of the machine-generated text, the lengths of the machine-generated text and the reference text and other factors.

In one embodiment of the invention, when the determined novelty of the machine-generated text is greater than a preset novelty threshold, the machine-generated text may be determined to be recommendable machine-generated text.

Specifically, if the novelty of the determined machine-generated text is greater than a preset novelty threshold, it indicates that the machine-generated text has a high use value, and may be determined as a recommendable machine-generated text, and the machine-generated text is recommended to the user in a specific scenario. For example, in a human-computer interaction scenario, if a certain machine-generated text has a high degree of novelty, it may be recorded so as to be recommended to the user in the corresponding scenario.

Corresponding to the method for determining the novelty of the machine-generated text provided by the embodiment of the present invention, an embodiment of the present invention provides a device for determining the novelty of the machine-generated text, and referring to fig. 4, the method may include the following modules:

an obtaining module 401, configured to obtain a machine-generated text and a plurality of reference texts corresponding to the machine-generated text;

a first determining module 402, configured to determine an overlap factor of the machine-generated text according to a word included in the machine-generated text and a word included in the multiple reference texts, where the word is a word obtained by segmenting a text according to a preset segmentation length;

a second determining module 403, configured to determine a repeated penalty factor of the machine-generated text according to a short sentence included in the machine-generated text, where the short sentence is a sentence obtained by segmenting a text according to a preset delimiter;

a third determining module 404, configured to determine a length penalty factor of the machine-generated text according to the text length of the machine-generated text, the average text length of the multiple reference texts, and the minimum text length of the multiple reference texts;

a fourth determining module 405, configured to determine the novelty of the machine-generated text according to the overlap factor, the repetition penalty factor, and the length penalty factor of the machine-generated text.

In an embodiment of the present invention, the first determining module 402 may be specifically configured to:

In an embodiment of the present invention, the machine-generated text includes a plurality of texts, and the first determining module 402 may be specifically configured to:

wherein, P _avg An overlap factor, P, representing the machine-generated text _n Represents the overlap factor, w, corresponding to the preset segmentation length n _n And the preset weight of the preset segmentation length N is represented, and the N represents the total number of the preset segmentation lengths.

In an embodiment of the present invention, the second determining module 403 may specifically be configured to:

determining short sentences contained in the machine-generated text;

In an embodiment of the present invention, the third determining module 404 may specifically be configured to:

A minimum text length of a plurality of reference texts representing the machine-generated text C, phi (C) representsThe machine generates a length penalty factor for text c.

In an embodiment of the present invention, the fourth determining module 405 may specifically be configured to:

In an embodiment of the present invention, on the basis of the apparatus shown in fig. 4, a fifth determining module may further be included, where the fifth determining module is configured to:

By applying the novelty determining device for the machine-generated text, the machine-generated text and a plurality of reference texts corresponding to the machine-generated text are obtained; determining an overlapping factor of the machine-generated text according to words included in the machine-generated text and words included in the reference texts, wherein the words are words obtained by segmenting the text according to a preset segmentation length; determining a repeated punishment factor of the machine-generated text according to a short sentence included in the machine-generated text, wherein the short sentence is a sentence obtained by segmenting the text according to a preset separator; determining a length penalty factor of the machine-generated text according to the text length of the machine-generated text, the average text length of the plurality of reference texts and the minimum text length of the plurality of reference texts; and determining the novelty of the machine-generated text according to the overlapping factor, the repetition penalty factor and the length penalty factor of the machine-generated text. Therefore, factors such as the overlapping degree of the machine-generated text and the reference text, the repetition degree of the machine-generated text, the lengths of the machine-generated text and the reference text and the like are comprehensively considered, and the novelty of the machine-generated text can be more effectively measured.

Corresponding to the embodiment of the method for determining novelty of machine-generated text, the embodiment of the present invention further provides an electronic device, as shown in fig. 5, including a processor 501, a communication interface 502, a memory 503 and a communication bus 504, where the processor 501, the communication interface 502 and the memory 503 are communicated with each other via the communication bus 504,

a memory 503 for storing a computer program;

the processor 501, when executing the program stored in the memory 503, implements the following steps:

determining an overlapping factor of the machine-generated text according to words included in the machine-generated text and words included in the reference texts, wherein the words are words obtained by segmenting the text according to a preset segmentation length;

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

The electronic equipment for the machine-generated text provided by the embodiment of the invention is applied to obtain the machine-generated text and a plurality of reference texts corresponding to the machine-generated text; determining an overlapping factor of the machine-generated text according to words included in the machine-generated text and words included in the reference texts, wherein the words are words obtained by segmenting the text according to a preset segmentation length; determining a repeated punishment factor of the machine-generated text according to a short sentence included in the machine-generated text, wherein the short sentence is a sentence obtained by segmenting the text according to a preset separator; determining a length penalty factor of the machine-generated text according to the text length of the machine-generated text, the average text length of the plurality of reference texts and the minimum text length of the plurality of reference texts; and determining the novelty of the machine-generated text according to the overlapping factor, the repetition penalty factor and the length penalty factor of the machine-generated text. Therefore, factors such as the overlapping degree of the machine-generated text and the reference text, the repetition degree of the machine-generated text, the lengths of the machine-generated text and the reference text and the like are comprehensively considered, and the novelty of the machine-generated text can be more effectively measured.

An embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program implements any of the above method steps.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the novelty determining apparatus, the electronic device, and the computer-readable storage medium of the machine-generated text, since they are substantially similar to the embodiments of the novelty determining method of the machine-generated text, the description is simple, and for the relevant points, refer to the partial description of the embodiments of the novelty determining method of the machine-generated text.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for novelty determination of machine-generated text, the method comprising:

determining the novelty of the machine-generated text according to the overlapping factor, the repetition penalty factor and the length penalty factor of the machine-generated text;

the step of determining a repetition penalty factor of the machine-generated text according to the clause included in the machine-generated text includes:

determining short sentences contained in the machine-generated text;

calculating the similarity between short sentences contained in the machine-generated text, and determining a repeated penalty factor of the machine-generated text based on the similarity between the short sentences;

the step of determining a length penalty factor for the machine-generated text based on the text length of the machine-generated text, the average text length of the plurality of reference texts, and the minimum text length of the plurality of reference texts comprises:

where C represents machine-generated text C, l _C A text length representing the machine-generated text c,

2. The method of claim 1, wherein the preset segmentation length is multiple, and the step of determining the overlap factor of the machine-generated text according to the words included in the machine-generated text and the words included in the multiple reference texts comprises:

3. The method of claim 2, wherein the machine-generated text is plural;

the step of determining the overlap factor corresponding to each preset segmentation length according to the word corresponding to the preset segmentation length included in the machine-generated text and the word corresponding to the preset segmentation length included in the plurality of reference texts includes, for each preset segmentation length, includes:

4. The method according to claim 3, wherein the step of determining the overlap factor corresponding to each preset segmentation length based on the preset parameters, the first number and the second number of the words corresponding to each preset segmentation length comprises:

wherein n represents a preset segmentation length, candidates represent the multiple machine-generated texts, references represent multiple reference texts of a machine-generated text C, r represents one reference text of the multiple reference texts, n-gram represents a word with a preset segmentation length of n, C represents the machine-generated text C, λ represents the preset parameter, and Count _C (n-gram) represents the first number of words, count, corresponding to the preset segmentation length n of the machine-generated text c _c-ref (n-gram) represents the number of words corresponding to the preset segmentation length n of the reference text ref corresponding to the machine-generated text c, Δ represents the second number of words corresponding to the preset segmentation length n of the plurality of reference texts corresponding to the machine-generated text c, and P _n And representing the overlapping factor corresponding to the preset segmentation length n.

5. The method according to claim 3, wherein the step of performing weighted summation on the overlap factor corresponding to each preset segmentation length based on the preset weight of each preset segmentation length to obtain the overlap factor of the machine-generated text comprises:

6. The method of claim 1, wherein the step of determining the novelty of the machine-generated text in accordance with the overlap factor, repetition penalty factor, and length penalty factor of the machine-generated text comprises:

7. The method of claim 1, further comprising:

8. An apparatus for machine-generated text novelty determination, the apparatus comprising:

the acquisition module is used for acquiring a machine-generated text and a plurality of reference texts corresponding to the machine-generated text;

the fourth determining module is used for determining the novelty of the machine-generated text according to the overlapping factor, the repetition penalty factor and the length penalty factor of the machine-generated text;

the second determining module is specifically configured to:

determining short sentences contained in the machine-generated text;

the third determining module is specifically configured to:

a plurality of text c representing the machine generated textAverage text length of the reference text, <' >>

A minimum text length of a plurality of reference texts representing the machine-generated text C, and phi (C) represents a length penalty factor for the machine-generated text C. />