CN117807236A - Text detection method and device, electronic equipment and storage medium - Google Patents

Text detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117807236A
CN117807236A CN202410121988.0A CN202410121988A CN117807236A CN 117807236 A CN117807236 A CN 117807236A CN 202410121988 A CN202410121988 A CN 202410121988A CN 117807236 A CN117807236 A CN 117807236A
Authority
CN
China
Prior art keywords
text
original text
speech
sentence
distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410121988.0A
Other languages
Chinese (zh)
Inventor
赖清泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202410121988.0A priority Critical patent/CN117807236A/en
Publication of CN117807236A publication Critical patent/CN117807236A/en
Pending legal-status Critical Current

Links

Abstract

The invention relates to a text detection method, a text detection device, electronic equipment and a storage medium, and discloses a method for acquiring an original text to be detected; extracting sentences contained in the original text, and detecting key features of the sentences; calculating language structure characteristics of the original text by utilizing key characteristics corresponding to each sentence; classifying the original text based on the language structure features to obtain the text type corresponding to the original text. According to the method, the key features of the sentences in the original text are extracted, so that important information in the text can be captured. And calculating the language structure characteristics of the original text by utilizing the key characteristics corresponding to each sentence. The characteristics and composition structure of the original text can be further understood and analyzed through language structural features. Finally, classifying the original text by utilizing language structural features, and accurately dividing the text into a model generation type and a manual writing type. The distinction between the automatically generated text and the manually written natural text of the model is realized.

Description

Text detection method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of text detection, and in particular, to a text detection method, apparatus, electronic device, and storage medium.
Background
With the development of computer technology, text automatic generation tools have gained widespread attention and application on the internet. Such tools may employ models to automatically generate text content such as news stories, advertising literature, and the like. While production efficiency and labor cost savings can be improved, there are some potential risks and problems.
Thus, there is a need for a method to distinguish between text generated by a model and natural text generated by a human composition by an automatic text generation tool.
Disclosure of Invention
In view of the above, the embodiments of the present invention provide a text detection method, apparatus, electronic device, and storage medium, so as to solve the problem of automatic classification and labeling of text generated by a text automatic generation tool using a model and natural text generated by manual writing.
In a first aspect, an embodiment of the present invention provides a text detection method, where the method includes:
acquiring an original text to be detected; extracting sentences contained in the original text, and detecting key features of the sentences; calculating language structure characteristics of the original text by utilizing key characteristics corresponding to each sentence; classifying the original text based on the language structure features to obtain text types corresponding to the original text, wherein the text types comprise model generation types and manual writing types.
In a second aspect, an embodiment of the present invention provides a text detection apparatus, including:
the acquisition module is used for acquiring the original text to be detected; the extraction module is used for extracting sentences contained in the original text and detecting key features of the sentences; the calculation module is used for calculating language structure characteristics of the original text by utilizing key characteristics corresponding to each sentence; and the processing module is used for classifying the original text based on the language structure characteristics to obtain a text type corresponding to the original text, wherein the text type comprises a model generation type and a manual writing type.
In a third aspect, an embodiment of the present invention provides an electronic device, including: the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions to perform the method of the first aspect or any implementation manner corresponding to the first aspect.
In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of the first aspect or any of its corresponding embodiments.
According to the method, the key features of the sentences in the original text are extracted, so that important information in the text can be captured. And calculating the language structure characteristics of the original text by utilizing the key characteristics corresponding to each sentence. Language structure features may include confusion distribution, part-of-speech distribution, dependency syntax distribution, and the like. The characteristics and composition structure of the original text can be further understood and analyzed through language structural features. Finally, classifying the original text by utilizing language structural features, and accurately dividing the text into a model generation type and a manual writing type. The distinction between the automatically generated text and the manually written natural text of the model is realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram of a text detection method according to some embodiments of the invention;
FIG. 2 is a flow diagram of a text detection method according to some embodiments of the invention;
fig. 3 is a block diagram of a structure of a text detection apparatus according to an embodiment of the present invention;
fig. 4 is a schematic hardware structure of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
According to embodiments of the present invention, there is provided a text detection method, apparatus, electronic device, and storage medium, it should be noted that the steps illustrated in the flowcharts of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different from that herein.
In this embodiment, a text detection method is provided, fig. 1 is a flowchart of the text detection method according to an embodiment of the present invention, and as shown in fig. 1, the flowchart includes the following steps:
step S101, obtaining an original text to be detected.
In the embodiment of the application, the original text refers to the text which needs to be classified and obtained from different sources. In particular, the original text to be detected may be obtained in a variety of ways and sources, such as from a news website, podcast, or advertisement, among other media channels. When the original text to be detected is acquired, data extraction and acquisition can be performed according to specific requirements and acquisition channels by using web crawler technology, API interface, content subscription and other modes.
Step S102, extracting sentences contained in the original text, and detecting key features of the sentences.
In the embodiment of the application, the original text is segmented by using a segmenter, and the text is split into a plurality of sentences. And then, marking the parts of speech of each sentence, and associating each word in the sentence with the corresponding part of speech. Part of speech tagging may use a pre-trained part of speech tagging, such as a statistics-based tagging or a deep learning-based tagging. Common part-of-speech labels include nouns (NN), verbs (VB), adjectives (JJ), and the like.
After determining the part of speech, the probability of generation of each word, i.e. the probability of occurrence of the word in the corpus, is calculated. The probability of generation may be estimated by a statistical language model, such as an n-gram model or a neural network language model. For large-scale corpora, the non-emerging words can be processed using smoothing techniques. In addition, it is also necessary to count the number of occurrences of each part of speech in the sentence. Each sentence may be traversed, parts of speech in the sentence counted, and the number of occurrences of each part of speech recorded. Finally, the occurrence frequency of each part of speech in the sentence and the generation probability of the words are used as key features.
Step S103, language structure features of the original text are calculated by utilizing the key features corresponding to each sentence.
In the embodiment of the application, the confusion degree distribution of the original text can be calculated by using the generation probability of the words, and the part-of-speech distribution and the dependency syntax distribution of the original text can be calculated by using the occurrence times of the part-of-speech. It will be appreciated that the confusion distribution may be used to reflect the complexity and uncertainty of the language of the text. The part-of-speech distribution may help to understand the relative frequencies and distribution of the various parts-of-speech in the text. For example, the proportions of parts of speech such as nouns, verbs, adjectives, etc. in the whole text can be calculated, and differences among different types of text can be observed, so that the language style and characteristics of the text can be known. The dependency syntax describes the grammatical relation between words. A dependency syntax analyzer may be used to analyze the syntax structure of each statement and calculate the distribution of the different types of dependencies. For example, the frequencies of main-predicate relationships, guest-move relationships, modifier relationships, etc. can be counted, so that the use condition and preference of different grammar structures in the text can be known.
By computing and analyzing these language structural features using the corresponding key features of the sentence, a deep understanding of the language features of the text can be obtained and reliable basis and clues can be provided for subsequent classification of the text.
Step S104, classifying the original text based on the language structure features to obtain text types corresponding to the original text, wherein the text types comprise model generation types and manual writing types.
Specifically, for an original text to be classified, firstly extracting language structural features of the original text, and then predicting the original text by using a trained classifier model to obtain a corresponding text type. The probability values predicted by the model may be used to measure the confidence of the classification.
The training process of the classifier model is as follows: first, an original text dataset with annotations is collected, including samples of the model auto-generation type and samples of the human-written type. The samples in the original text data set can be from a corpus, such as a CCL corpus, and the corpus contains real human language data in various fields, and has certain accuracy and representativeness. Meanwhile, the existing open source large model is used for rewriting the corpus or the dialogue question to generate a rewritten text. Thus, the automatically generated text covering different fields and different types of models can be obtained. Therefore, by combining the text in the corpus and rewriting the text, more diversified and rich training samples can be obtained. The method is beneficial to distinguishing the characteristics of different text types by the model, so that the model automatic generation type and the manual writing type are better distinguished. In addition, using a large model to generate rewritten text may yield some negative examples. These negative examples can be used as counterexamples in training the model to help the model learn the characteristics of the model generation type and distinguish from the artificial composition type.
Secondly, preprocessing is carried out on the text, such as word segmentation, stop word removal, part-of-speech tagging, syntactic analysis and the like. Language structural features including confusion-degree distribution, part-of-speech distribution, and dependency syntax distribution are extracted from the preprocessed text. For the confusion distribution, the confusion of each sentence may be calculated and its distribution characteristics, such as average confusion, maximum or minimum confusion, etc., may be counted. For part-of-speech distribution, the frequency or proportion of occurrences of each part-of-speech in each sentence is counted. For dependency syntax distribution, the occurrence frequency or proportion of various dependencies in each statement is counted.
The extracted language structural features are then suitably represented, such as by converting the confusion distribution, the part-of-speech distribution, and the dependency syntactic distribution into feature vectors. Vector representation methods may be used, such as stitching individual features into one feature vector or using sparse matrix representation. And training a classifier model by using the marked data set and taking the feature vector and the corresponding text type as training samples. Common classifier models include naive bayes, support vector machines, logistic regression, etc.
According to the method, the key features of the sentences in the original text are extracted, so that important information in the text can be captured. And calculating the language structure characteristics of the original text by utilizing the key characteristics corresponding to each sentence. Language structure features may include confusion distribution, part-of-speech distribution, dependency syntax distribution, and the like. The characteristics and composition structure of the original text can be further understood and analyzed through language structural features. Finally, classifying the original text by utilizing language structural features, and accurately dividing the text into a model generation type and a manual writing type. The distinction between the automatically generated text and the manually written natural text of the model is realized.
In this embodiment, a text detection method is provided, fig. 2 is a flowchart of the text detection method according to an embodiment of the present invention, and as shown in fig. 2, the flowchart includes the following steps:
step S201, an original text to be detected is acquired. The detailed description refers to the corresponding related descriptions of the above embodiments, and will not be repeated here.
Step S202, extracting sentences contained in the original text, and detecting key features of the sentences.
Extracting the sentence contained in the original text may be to split the original text into a plurality of sentences using a sentence splitting method. The sentence separating method includes using punctuation marks (such as periods, question marks, exclamation marks) as separators or using natural language processing technology for grammar analysis to identify the boundaries of sentences.
The key features of the detection statement in step S202 include the following steps a1-a3:
step a1, forming a word sequence by utilizing words contained in the sentence, and sequentially calculating the generation probability corresponding to each word in the word sequence.
Specifically, the sentence to be calculated to generate the probability is segmented and a word sequence is formed. A word segmentation tool may be used to segment a sentence into discrete words. The calculation formula of the generation probability is as follows:
where P (X) represents the probability of being generated in the language model, n is the text length, W 1 W 2 ...W n Each of the words in the sequence of words.
Step a2, detecting the part of speech corresponding to the words contained in the sentence, and determining the occurrence times of the part of speech in the sentence.
In the embodiment of the application, a corpus containing tagged parts of speech is obtained or an existing part of speech tagging dataset is used. A part-of-speech tagger is used or a part-of-speech tagger model is trained to assign a respective part-of-speech tag to each word in the sentence. The part-of-speech tagging model may be based on statistical methods or deep learning methods. Each word is associated with its corresponding part of speech. And traversing each sentence to count the parts of speech, and calculating the occurrence times of each part of speech in the sentences. By detecting the part of speech corresponding to the words contained in the sentence and determining the number of occurrences of the part of speech in the sentence, information about the use condition and distribution of the part of speech in the sentence can be conveniently obtained.
And a step a3, taking the generation probability corresponding to the words and the occurrence times of the parts of speech in the sentences as key features.
By using the word generation probability and the part-of-speech occurrence number as key features, the quality and the authenticity of the original text can be judged and evaluated. The probability of generation can reflect the potential predictive power of the language model on the words, and the number of occurrences of the parts of speech can reveal the grammatical features and structure of the text. These key features can be used for subsequent classification of automatically generated text to help identify non-conforming, spurious, or unreasonable content.
Step S203, language structure features of the original text are calculated by utilizing the key features corresponding to each sentence.
In the embodiment of the application, language structural features of the original text are calculated by utilizing key features corresponding to each sentence, and the method comprises the following steps b1-b3:
and b1, determining the confusion distribution of the original text according to the generation probability of the words contained in each sentence.
Specifically, determining the confusion distribution of the original text according to the generation probability of the words contained in each sentence comprises the following steps: calculating statement confusion of the statement according to the generation probability corresponding to each word and the text length of the statement; quantifying the sentence confusion degree to obtain a quantified confusion degree; determining the confusion degree quantity of the quantization confusion degree falling into each preset quantization interval, and determining the statement quantity corresponding to the preset quantization interval according to the confusion degree quantity; and calculating the ratio of the statement number corresponding to the preset quantization interval to the statement total number of the original text, and obtaining the confusion distribution.
It can be appreciated that the original text includes m sentences in total, and after calculating the confusion degree of each sentence, the original text is expressed as PPLs 1 ,PPLs 2 ,...,PPLs m Then, the sentence confusion degree is quantized based on 0.1, only the first 1000 values are quantized (namely the original value is directly discarded with the original value being more than 100), and the duty ratio of each quantization interval in the original text is further calculated, so as to obtain the result of PPP=PPP 1 ,PPP 2 ,...,PPP 1000
For example: three sentences A, B and C are provided, and the sentence confusion degree of the sentence A, the sentence B and the sentence C is PPL A =1.5,PPL B =12.3,PPL C The quantization operation based on 0.1 is performed for=1.39, only the first 1000 values are quantized, and values greater than 100 are directly discarded, resulting in:
the duty cycle of each quantization interval in the overall sentence is then calculated:
PPP=(PPP 1 =0.0,PPP 2 =0.0,...,PPP 15 =0.33,...,PPP 123 =0.33,...,PPP 1000 =0.0]
=[0.0,0.0,...,0.33,...,0.33,...,0.33,...,0.0]。
and b2, determining part-of-speech distribution of the original text according to the occurrence times of the parts-of-speech in each sentence so as to determine the dependency syntax distribution of the original text.
Specifically, determining the part-of-speech distribution of the original text according to the occurrence times of the part-of-speech in each sentence includes: acquiring the total number of words contained in the original text; and calculating the ratio of the occurrence times corresponding to each part of speech to the total number of words to obtain part of speech distribution.
It will be appreciated that the original text is broken down into individual words using a word segmentation tool. Commonly used word segmentation tools have NLTK, jieba, etc. And marking the parts of speech of the word after word segmentation, and associating each word with the corresponding part of speech. And calculating the total number of words after word segmentation, namely the total number of words contained in the original text. And counting the occurrence times of each part of speech in the words according to the part of speech tagging result. And obtaining the part-of-speech distribution by the ratio of the occurrence times of each part-of-speech to the total number of words.
According to the method and the device, through word segmentation and part-of-speech tagging, the original text can be accurately processed, the text is split into each independent word, and the part-of-speech of each word is associated. By calculating the part-of-speech distribution, the relative frequencies of different parts-of-speech in the text can be obtained, so that the characteristics and the content of the text are analyzed in a fine granularity.
Specifically, determining the dependency syntax distribution of the original text according to the occurrence times of the parts of speech in each sentence includes: determining the target number of times that each two parts of speech co-occur in the sentence based on the number of times that the parts of speech occur in the sentence; calculating posterior probability of co-occurrence of every two parts of speech in the sentence according to the target times; calculating the average posterior probability of each two parts of speech based on the posterior probability of the co-occurrence of each two parts of speech in each sentence; and constructing the dependency syntax distribution according to the part of speech and the average posterior probability.
It will be appreciated that. For sentences in the original text, part-of-speech tagging is performed first, and then the occurrence times of each part-of-speech in the sentences are counted. Determining the target number of co-occurrence of each two parts of speech in the sentence according to the occurrence number of the parts of speech, and recording as gwc= [ GWC ] Noun x noun ,GWC Noun x adjective ,...]. Based on the target number, a posterior probability of co-occurrence of each two parts of speech is calculated. The posterior probability is calculated as follows:
and then calculating the average posterior probability of each two parts of speech according to the posterior probability of the co-occurrence of each two parts of speech in each sentence. I.e. the posterior probability of each part-of-speech combination is averaged. And constructing the dependency syntax distribution according to the part of speech and the average posterior probability. A directed graph may be constructed to represent the dependency syntactic distribution between parts of speech using the parts of speech as nodes and the average posterior probability as the weight of the edges.
By considering the number of occurrences of the part of speech and the number of co-occurrences of the target, the relationship between the parts of speech can be modeled more accurately. Rather than simply counting the frequency of parts of speech or combinations of occurrences. Based on the calculation of posterior probabilities, the occurrence of parts of speech in a particular context can be taken into account, thereby better reflecting the correlation between parts of speech. By calculating the average posterior probability, the correlation between parts of speech can be quantified. This helps to understand the structure and meaning of the sentence in depth. Finally, the dependency syntax distribution is constructed based on the part of speech and the average posterior probability, and can be used for analyzing the grammar structure in the sentence, including grammar relation and dependency relation.
And b3, taking the confusion degree distribution, the part-of-speech distribution and the dependency syntax distribution as language structure characteristics.
Specifically, by using the confusion degree distribution, the part-of-speech distribution and the dependency syntax distribution as language structure features, the structure and grammar of the original text can be more comprehensively analyzed and understood, thereby providing more accurate basis for subsequent automatic processing and text classification.
Step S204, classifying the original text based on the language structure features to obtain text types corresponding to the original text, wherein the text types comprise model generation types and manual writing types.
Specifically, classifying the original text based on the language structure features to obtain the text type corresponding to the original text, including the following steps c1-c4:
step c1, obtaining the theme type corresponding to the original text.
Performing topic classification or topic extraction on the original text by text analysis, machine learning and other methods, and determining the topic type of the original text, wherein the topic type comprises the following steps: news, blogs, comments, articles, etc.
And c2, acquiring a weight value, a threshold range and a preset text type corresponding to the threshold range associated with the theme type.
The setting of the weight value for each topic type may be performed in advance according to domain knowledge or a priori experience, and the weight value may represent the importance degree of the topic type. At the same time, a threshold range is set for each topic type, which represents the threshold at which the weighting value for that topic type falls within that range. A preset text type is also set for each threshold range, for example: when the theme type is news, the corresponding weight value is w1, the threshold range is [ a1, a2] and [ a3, a4], [ a1, a2] corresponding preset text type is artificial writing type, and [ a3, a4] corresponding preset text type is model automatic generating type, wherein a2 is smaller than a3.
And c3, carrying out weighted calculation on the basis of the confusion degree distribution, the part-of-speech distribution, the dependency syntax distribution and the weight value to obtain the weight value.
The final weighted value is obtained by respectively carrying out weighted calculation on the confusion degree distribution (the measurement of the degree of difficulty of the statement in the text), the part-of-speech distribution (the distribution condition of each part-of-speech in the text) and the dependency syntax distribution (the grammar relation among the parts-of-speech) and the corresponding weighted value.
And c4, determining a target threshold range in which the weighted value falls, and taking a preset text type corresponding to the threshold range as the text type of the original text.
And determining a target threshold range within which the weighted value falls according to the size of the weighted value. And marking or classifying the preset text type corresponding to the threshold range as the text type of the original text. For example: when the theme type is news, the threshold value ranges are [ a1, a2] and [ a3, a4], [ a1, a2] corresponding preset text types are manual writing types, the [ a3, a4] corresponding preset text types are model automatic generation types, and if the weighting value falls into [ a1, a2], the text type of the original text is the manual writing type. If the weighting value falls within [ a3, a4], the text type of the original text is the model auto-generation type.
According to the method and the device, the topic type of the text can be accurately judged by classifying or extracting the topics of the original text, so that further text understanding and processing are facilitated. The influence of different factors on the text type can be comprehensively considered by adopting the weighted calculation of the confusion degree distribution, the part-of-speech distribution, the dependency syntax distribution and the weight value, so that the comprehensive analysis on the text is increased. The text can be analyzed from different angles by combining confusion degree distribution, part-of-speech distribution and dependency syntax distribution, so that the grasp of text characteristics is enhanced, and the text semantics and structure can be deeply understood. Reliable basis is provided for effectively distinguishing the automatically generated type text of the model and the manually written type text.
The embodiment also provides a text detection device, which is used for implementing the above embodiment and the preferred implementation manner, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The present embodiment provides a text detection apparatus, as shown in fig. 3, including:
an acquiring module 301, configured to acquire an original text to be detected;
the extracting module 302 is configured to extract a sentence contained in the original text, and detect a key feature of the sentence;
a calculation module 303, configured to calculate a language structure feature of the original text using the key feature corresponding to each sentence;
the processing module 304 is configured to classify the original text based on the language structure feature, and obtain a text type corresponding to the original text, where the text type includes a model generation type and a manual writing type.
In some optional embodiments, the extracting module 302 is configured to sequentially calculate the generation probability corresponding to each word in the word sequence by using the word composition word sequence included in the sentence; detecting the part of speech corresponding to the words contained in the sentence, and determining the occurrence times of the part of speech in the sentence; and taking the generation probability corresponding to the word and the occurrence frequency of the part of speech in the sentence as key features.
In some alternative embodiments, the computing module 303 includes:
the first calculation unit is used for determining confusion distribution of the original text according to the generation probability of the words contained in each sentence;
a second calculation unit for determining the part-of-speech distribution of the original text according to the occurrence times of the part-of-speech in each sentence to determine the dependency syntax distribution of the original text;
and the generating unit is used for taking the confusion degree distribution, the part-of-speech distribution and the dependency syntax distribution as language structure characteristics.
In some optional embodiments, the first calculating unit is configured to calculate a statement confusion degree of the statement according to a generation probability corresponding to each word and a text length of the statement; quantifying the sentence confusion degree to obtain a quantified confusion degree; determining the confusion degree quantity of the quantization confusion degree falling into each preset quantization interval, and determining the statement quantity corresponding to the preset quantization interval according to the confusion degree quantity; and calculating the ratio of the statement number corresponding to the preset quantization interval to the statement total number of the original text, and obtaining the confusion distribution.
In some optional embodiments, the second calculating unit is configured to obtain a total number of words of the words included in the original text; and calculating the ratio of the occurrence times corresponding to each part of speech to the total number of words to obtain part of speech distribution.
In some alternative embodiments, the second calculating unit is configured to determine a target number of times each two parts of speech co-occur in the sentence based on the number of times the parts of speech occur in the sentence; calculating posterior probability of co-occurrence of every two parts of speech in the sentence according to the target times; calculating the average posterior probability of each two parts of speech based on the posterior probability of the co-occurrence of each two parts of speech in each sentence; and constructing the dependency syntax distribution according to the part of speech and the average posterior probability.
In some optional embodiments, the processing module 304 is configured to obtain a topic type corresponding to the original text; acquiring a weight value, a threshold range and a preset text type corresponding to the threshold range associated with the theme type; weighting calculation is carried out on the basis of the confusion degree distribution, the part-of-speech distribution, the dependency syntax distribution and the weight value to obtain a weight value; and determining a target threshold range in which the weighted value falls, and taking a preset text type corresponding to the threshold range as the text type of the original text.
Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an alternative embodiment of the present invention, as shown in fig. 4, the electronic device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system).
The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.
Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.
The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created from the use of the electronic device of the presentation of one applet landing page, and the like. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.
The electronic device also includes a communication interface 30 for the electronic device to communicate with other devices or communication networks.
The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims (10)

1. A method of text detection, the method comprising:
acquiring an original text to be detected;
extracting sentences contained in the original text, and detecting key features of the sentences;
calculating language structure characteristics of the original text by utilizing key characteristics corresponding to each sentence;
classifying the original text based on the language structure features to obtain text types corresponding to the original text, wherein the text types comprise model generation types and manual writing types.
2. The method of claim 1, wherein the detecting key features of the statement comprises:
the words contained in the sentences are utilized to form word sequences, and the generation probability corresponding to each word in the word sequences is calculated in sequence;
detecting the part of speech corresponding to the words contained in the sentence, and determining the occurrence times of the part of speech in the sentence;
and taking the generation probability corresponding to the word and the occurrence frequency of the part of speech in the sentence as the key characteristic.
3. The method according to claim 2, wherein calculating the language structural feature of the original text using the key feature corresponding to each sentence comprises:
determining the confusion degree distribution of the original text according to the generation probability of the words contained in each sentence;
determining part-of-speech distribution of the original text according to the occurrence times of parts-of-speech in each sentence so as to determine dependency syntax distribution of the original text;
and distributing the confusion degree, the part-of-speech distribution and the dependency syntax distribution as the language structure characteristics.
4. A method according to claim 3, wherein said determining a confusion distribution of said original text based on the probability of generation of words contained in each of said sentences comprises:
calculating statement confusion of the statement according to the generation probability corresponding to each word and the text length of the statement;
quantifying the statement confusion degree to obtain quantified confusion degree;
determining the confusion degree quantity of quantization confusion degrees falling into each preset quantization interval, and determining the statement quantity corresponding to the preset quantization interval according to the confusion degree quantity;
and calculating the ratio between the statement number corresponding to the preset quantization interval and the statement total number of the original text to obtain the confusion distribution.
5. A method according to claim 3, wherein said determining the part-of-speech distribution of the original text based on the number of occurrences of part-of-speech in each of the sentences comprises:
acquiring the total number of words contained in the original text;
and calculating the ratio of the occurrence times corresponding to each part of speech to the total number of words to obtain the part of speech distribution.
6. The method of claim 3, wherein said determining the dependency syntax distribution of the original text based on the number of occurrences of parts of speech in each of the sentences comprises:
determining the target number of times of co-occurrence of each two parts of speech in the sentence based on the number of times of occurrence of the parts of speech in the sentence;
calculating posterior probability of co-occurrence of every two parts of speech in the sentence according to the target times;
calculating the average posterior probability of each two parts of speech based on the posterior probability of the co-occurrence of each two parts of speech in each sentence;
and constructing the dependency syntax distribution according to the part of speech and the average posterior probability.
7. A method according to claim 3, wherein classifying the original text based on the language structural features to obtain a text type corresponding to the original text comprises:
obtaining a theme type corresponding to the original text;
acquiring a weight value, a threshold range and a preset text type corresponding to the threshold range associated with the theme type;
based on the confusion degree distribution, the part-of-speech distribution, the dependency syntax distribution and the weight value, weighting calculation is carried out, and a weight value is obtained;
and determining a target threshold range within which the weighted value falls, and taking a preset text type corresponding to the threshold range as the text type of the original text.
8. A text detection device, the device comprising:
the acquisition module is used for acquiring an original text to be detected, wherein the original text is automatically generated by the text generation tool;
the extraction module is used for extracting sentences contained in the original text and detecting key features of the sentences;
the calculation module is used for calculating language structure characteristics of the original text by utilizing key characteristics corresponding to each sentence;
and the processing module is used for classifying the original text based on the language structure characteristics to obtain a text type corresponding to the original text, wherein the text type comprises a model generation type and a manual writing type.
9. An electronic device, comprising:
a memory and a processor in communication with each other, the memory having stored therein computer instructions which, upon execution, cause the processor to perform the method of any of claims 1 to 7.
10. A computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1 to 7.
CN202410121988.0A 2024-01-29 2024-01-29 Text detection method and device, electronic equipment and storage medium Pending CN117807236A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410121988.0A CN117807236A (en) 2024-01-29 2024-01-29 Text detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410121988.0A CN117807236A (en) 2024-01-29 2024-01-29 Text detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117807236A true CN117807236A (en) 2024-04-02

Family

ID=90420143

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410121988.0A Pending CN117807236A (en) 2024-01-29 2024-01-29 Text detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117807236A (en)

Similar Documents

Publication Publication Date Title
US20230161971A1 (en) Dynamically Updated Text Classifier
US9465865B2 (en) Annotating entities using cross-document signals
CN110263248B (en) Information pushing method, device, storage medium and server
TWI512719B (en) An acoustic language model training method and apparatus
CN108875059B (en) Method and device for generating document tag, electronic equipment and storage medium
TW202020691A (en) Feature word determination method and device and server
CN110162771B (en) Event trigger word recognition method and device and electronic equipment
CN106570180A (en) Artificial intelligence based voice searching method and device
CN107102993B (en) User appeal analysis method and device
CN108536868B (en) Data processing method and device for short text data on social network
WO2018174816A1 (en) Method and apparatus for semantic coherence analysis of texts
CN109271624B (en) Target word determination method, device and storage medium
CN110889275A (en) Information extraction method based on deep semantic understanding
CN105389303B (en) A kind of automatic fusion method of heterologous corpus
WO2018174815A1 (en) Method and apparatus for semantic coherence analysis of texts
CN102591920B (en) Method and system for classifying document collection in document management system
CN112181490A (en) Method, device, equipment and medium for identifying function category in function point evaluation method
CN114661872A (en) Beginner-oriented API self-adaptive recommendation method and system
CN112183102A (en) Named entity identification method based on attention mechanism and graph attention network
Sonbol et al. Towards a semantic representation for functional software requirements
CN112000929A (en) Cross-platform data analysis method, system, equipment and readable storage medium
Geierhos et al. What did you mean?-Facing the challenges of user-generated software requirements
CN111563212A (en) Inner chain adding method and device
CN113138920B (en) Software defect report allocation method and device based on knowledge graph and semantic role labeling
CN109977391B (en) Information extraction method and device for text data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination