CN115796160B - Thesis redundant data cleaning method and device based on lexical affix and storage medium - Google Patents

Thesis redundant data cleaning method and device based on lexical affix and storage medium Download PDF

Info

Publication number
CN115796160B
CN115796160B CN202211586218.0A CN202211586218A CN115796160B CN 115796160 B CN115796160 B CN 115796160B CN 202211586218 A CN202211586218 A CN 202211586218A CN 115796160 B CN115796160 B CN 115796160B
Authority
CN
China
Prior art keywords
text
prefix
preset
module
cleaning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211586218.0A
Other languages
Chinese (zh)
Other versions
CN115796160A (en
Inventor
郭东恩
曲凯扬
郭丰硕
吴泽琛
周卓柯
贾超鑫
黄晓红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanyang Institute of Technology
Original Assignee
Nanyang Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanyang Institute of Technology filed Critical Nanyang Institute of Technology
Priority to CN202211586218.0A priority Critical patent/CN115796160B/en
Publication of CN115796160A publication Critical patent/CN115796160A/en
Application granted granted Critical
Publication of CN115796160B publication Critical patent/CN115796160B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The embodiment of the application discloses a method, a device and a storage medium for cleaning paper redundant data based on lexical affix, belonging to the technical field of text classification preprocessing; performing the participle cleaning on the text to be classified based on the lexical characteristics in natural language processing, and obtaining a real word part as a real word text; performing prefix cleaning on the real word text, obtaining cleaned text, and generating a prefix-free text; acquiring all domain terms from the prefix text based on a preset domain term table, and forming a domain term set by the acquired domain terms; identifying the text without the text with the text, removing the interference text of the identified module based on a preset interference list, and taking the text with the interference text removed as a metadata text; the metadata text and the domain term set elements are classified as plain text. The method and the device are beneficial to reasonably reducing the network resource consumption in text classification, and save the cost of text classification to a certain extent.

Description

Thesis redundant data cleaning method and device based on lexical affix and storage medium
Technical Field
The application relates to the technical field of text classification preprocessing, in particular to a method, a device, equipment and a storage medium for cleaning paper redundant data based on lexical affix.
Background
The rapid development of the Internet brings great convenience to the life of people, and various resources are rapidly increased in an exponential form, so that scientific paper resources are also greatly developed in the network, and information overload in the process of people acquisition is caused. How to effectively organize and manage the resources of the scientific paper is a key for solving the information overload problem in the resource acquisition process.
At present, an effective management mode for the resources of the scientific paper is to effectively classify the resources of the scientific paper. However, in the existing classification mode, a plurality of feasible labels are selected based on the whole article content to classify, when the whole article content is classified, the whole text is segmented and then classified in a word segmentation mode, and when a lot of texts are classified, a large amount of network resources are consumed, and a large amount of network resources are wasted to process contents useless for classification results. Therefore, the prior art has the problem that a large amount of network resources are wasted when text classification is performed, and the network resources are excessively consumed.
Disclosure of Invention
The embodiment of the application aims to provide a method, a device, equipment and a storage medium for cleaning paper redundant data based on lexical affix, which are used for cleaning redundant data of texts to be classified before classification so as to solve the problem of excessive network resource consumption in text classification in the prior art.
In order to solve the above technical problems, the embodiment of the present application provides a method for cleaning paper redundant data based on lexical affix, which adopts the following technical scheme:
a thesis redundant data cleaning method based on lexical affix includes:
acquiring a text to be classified;
performing the virtual word cleaning on the text to be classified based on the lexical characteristics in the natural language processing to obtain a real word part as a real word text, wherein the virtual word cleaning comprises the step of representing the virtual word part in the text to be classified by a preset specific symbol based on a real word and virtual word distinguishing rule in the lexical characteristics;
performing prefix cleaning on the real word text based on a preset grammar prefix and suffix screening method, obtaining cleaned text, and generating a prefix-free text, wherein the prefix cleaning comprises cleaning the real word text based on preset prefix unit bytes, generating a prefix-free text, and cleaning the prefix-free text based on preset suffix unit bytes;
Acquiring all domain terms from the prefix text based on a preset domain term table, and forming a domain term set by the acquired domain terms, wherein the domain term set can contain repeated elements;
performing module recognition on the text without the text based on a preset module mark list and a reference specification, removing the interference text of the recognized module based on a preset interference list after the module recognition is completed, and taking the text with the interference text removed as a metadata text;
the metadata text and the domain term set elements are classified as plain text.
Further, the step of representing the part of the virtual word in the text to be classified with a preset specific symbol based on the real word and virtual word distinguishing rule in the lexical property includes:
searching all the works from the texts to be classified based on the works list preset by the lexical characteristics, and simultaneously replacing text contents corresponding to the works by using preset identification signs.
Further, the cleaning the real word text based on the preset prefix unit byte to generate a prefix-free text includes:
based on a preset prefix unit table, elements in the table are used as prefix unit bytes to screen texts with prefix unit bytes from the text without prefix, and the texts with prefix unit bytes are differentially represented.
Further, the cleaning the prefix-free text based on the preset suffix unit byte includes:
based on a preset suffix unit table, elements in the table are used as suffix unit bytes to screen texts with suffix unit bytes from real word texts, and the texts with suffix unit bytes are distinguished and represented.
Further, the acquiring all domain terms from the prefix text based on the preset domain term table, and forming the acquired domain terms into a domain term set includes:
based on a scientific text platform or a professional field glossary constructed in advance, the field glossary is screened from the text without the text with the no text.
Further, the module identifying the text without the text based on the preset module mark list and the reference specification comprises:
based on the text information in the preset module mark list, identifying the position of the text information in the prefix-free text, and taking the text fragments among different text information as a conventional unit module.
Based on the reference standard, acquiring text information in the text without the string, acquiring word sizes, formats of the text, preset punctuation marks and special symbol information among the text, screening text fragments among the texts with different word sizes and formats and the preset punctuation marks and the special symbol information, and taking the text fragments as a special unit module.
Further, after the module identification is completed, performing interference text removal on the identified module based on the preset interference list includes:
and screening the conventional unit module and the special unit module based on the interference information in the interference list, and deleting text content corresponding to the interference information if the conventional unit module and the special unit module have the interference information, wherein the interference information comprises preset interference characters and interference symbols.
In order to solve the technical problems, the embodiment of the application also provides a thesis redundant data cleaning device based on lexical affix, which adopts the following technical scheme:
an apparatus for cleaning paper redundant data based on lexical affix, comprising:
the text acquisition module is used for acquiring texts to be classified;
the system comprises an imaginary term cleaning module, an imaginary term cleaning module and a real term processing module, wherein the imaginary term cleaning module is used for carrying out imaginary term cleaning on the text to be classified based on lexical characteristics in natural language processing to obtain real term parts as real term texts, and the imaginary term cleaning comprises the steps of representing the imaginary term parts in the text to be classified by preset specific symbols based on real terms and imaginary term distinguishing rules in the lexical characteristics;
The prefix cleaning module is used for cleaning the prefix of the real word text based on a preset grammar prefix and suffix screening method, obtaining the cleaned text and generating a prefix-free text, wherein the prefix cleaning comprises the steps of cleaning the real word text based on preset prefix unit bytes, generating the prefix-free text and cleaning the prefix-free text based on preset suffix unit bytes;
a domain term obtaining module, configured to obtain all domain terms from the prefix text based on a preset domain term table, and configure the obtained domain terms into a domain term set, where the domain term set may include repeated elements;
the interference information removing module is used for carrying out module identification on the text without the text based on a preset module mark list and a reference standard, removing the interference text of the identified module based on the preset interference list after the module identification is completed, and taking the text with the interference text removed as a metadata text;
and the text classification module is used for classifying the metadata text and the elements in the domain term set as pure text.
In order to solve the above technical problems, the embodiments of the present application further provide a computer device, which adopts the following technical schemes:
The computer equipment comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the paper redundant data cleaning method based on lexical suffixes in the embodiment of the application when executing the computer program.
In order to solve the above technical problems, embodiments of the present application further provide a non-volatile computer readable storage medium, which adopts the following technical solutions:
a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of a lexical affix-based method for cleaning redundant data of articles presented in embodiments of the present application.
Compared with the prior art, the embodiment of the application has the following main beneficial effects:
the embodiment of the application discloses a method, a device, equipment and a storage medium for cleaning paper redundant data based on lexical affix, which effectively determine the substantial text range of a text to be classified by performing the virtual word cleaning and prefix suffix cleaning on the text to be classified based on word characteristics, so that a user effectively avoids the problem of excessive consumption of network resources when the search range is overlarge; the method has the advantages that through acquiring the terms in the text field to be classified, the relevant useful information of a part of the text to be classified is preliminarily determined, and the accuracy of search results is improved; finally, by eliminating the interference information of the text to be classified, namely, the accuracy of searching is improved, and unnecessary waste of a part of network resources is effectively avoided, so that the method for cleaning the paper redundant data based on the lexical affix is beneficial to reasonably reducing the network resource consumption during text classification, and saves the cost of text classification to a certain extent.
Drawings
For a clearer description of the solution in the present application, a brief description will be given below of the drawings that are needed in the description of the embodiments of the present application, it being obvious that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is an exemplary system architecture diagram in which embodiments of the present application may be applied;
FIG. 2 is a flow chart of one embodiment of a lexical affix-based method of cleaning redundant data in accordance with embodiments of the present application;
FIG. 3 is a diagram illustrating an embodiment of the cleaning of the article according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of an embodiment of a lexical affix-based paper redundant data cleaning device according to the embodiments of the present application;
fig. 5 is a schematic structural diagram of a prefix cleaning module in the embodiment of the present application;
fig. 6 is a schematic structural diagram of an interference information removal module in an embodiment of the present application;
FIG. 7 is a schematic diagram of one embodiment of a computer device in an embodiment of the present application;
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the description of the figures above are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In order to better understand the technical solutions of the present application, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings.
As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Mov i ng Pi cture Experts Group Aud i o Layer I I I, dynamic video expert compression standard audio plane 3), MP4 (Mov i ng Pi cture Experts Group Aud i o Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that, the method for cleaning paper redundant data based on the lexical affix provided in the embodiments of the present application is generally executed by a server/terminal device, and accordingly, the device for cleaning paper redundant data based on the lexical affix is generally disposed in the server/terminal device.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, there is shown a flowchart of one embodiment of the lexical affix-based paper redundancy data cleansing method of the present application, the lexical affix-based paper redundancy data cleansing method comprising the steps of:
In step 201, a text to be classified is obtained.
In this embodiment, the text to be classified may be obtained from a cache, or may be captured from a network paper platform, or may be transmitted into a new pending form.
Step 202, performing an article cleaning on the text to be classified based on lexical characteristics in natural language processing, and obtaining an article part as an article text, wherein the article cleaning comprises representing the article part in the text to be classified by a preset specific symbol based on an article and an article distinguishing rule in the lexical characteristics.
In this embodiment, the real word and the imaginary word differentiating rule is set as follows: one possible distinguishing way is that in the characteristics of Chinese words, real words and imaginary words are divided according to nonsensical meaning of the words, wherein the real words comprise nouns, verbs, adjectives, measuring words, numerical words and pronouns, the six words can accurately express specific meanings such as "sun", "synthesis", "upper face" and the like, and the imaginary words comprise adverbs, auxiliary words, conjunctions, prepositions, sighing words and phonetic words, and the six words cannot be used independently and can only be used in an auxiliary way, such as "most", "therefore", "so" and the like. In another possible distinction, in english word characteristics, real words include six classes of nouns, verbs, adjectives, adverbs, numbers and pronouns, which can accurately express specific meanings such as "cl ass", "she", "r right", "one", "i s", "now", etc., and imaginary words include four classes of articles, conjunctions, prepositions and interjections, which cannot accurately express specific meanings such as "an", "from", "but", "oh", etc., and different languages have different part-of-speech classification manners, but are virtually indistinguishable based on the meaning of the words alone.
In some embodiments of the present application, the step 202 represents the part of the virtual word in the text to be classified with a preset specific symbol based on the real word and virtual word distinguishing rule in the lexical property, including searching all virtual words in the text to be classified based on a virtual word table preset in the lexical property, and simultaneously replacing text content corresponding to the virtual word with a preset identifier.
Searching all the virtual words from the text to be classified, and simultaneously replacing text contents corresponding to the virtual words by using preset identification symbols, for example, including words such as "most", "therefore", and the like in a virtual word classification table, respectively using the "most" and "therefore" as query fields, and querying the contents in the text to be classified, and if the "most", "therefore", using spaces or other symbols such as "x", and the like, replacing the contents in the text to be classified, wherein the text to be classified only includes real word text contents and the replaced symbols.
Referring specifically to fig. 3, fig. 3 is a diagram illustrating an embodiment of the term cleaning in the embodiment of the present application, in which a portion of a segment of text to be classified before the term cleaning is shown in 301, a portion of a term in a term table is shown in 302, and the content of the segment after the term cleaning is shown in 303.
Step 203, performing prefix cleaning on the real word text based on a preset grammar prefix and suffix screening method, obtaining a cleaned text, and generating a prefix-free text, wherein the prefix cleaning comprises cleaning the real word text based on preset prefix unit bytes, generating the prefix-free text, and cleaning the prefix-free text based on preset suffix unit bytes.
The preset grammar prefix and suffix screening method is implemented as follows: related words composed based on preset prefixes, such as "upper: upper, upper "," lower: below, below "," left: left "," small: small square "," old: teacher, teacher king and first: first, second, "etc., where" above, "below," and "under" when used denote orientations, "little," "king," and "first," "second," and so forth, when used denote a ranking, and thus such terms are referred to as terms of a particular prefix in a chinese corpus. Meanwhile, some words are also divided into suffix words, as we commonly see "mr. X: mr. Liu, mir. Wang, "Miss X": zhang Xiaojie, li Xiaojie "," XX: normalized, standardized, military "," X sub: the terms "aged," holed, "inked," etc., wherein "Liu mr," "Wang mr," "Zhang Xiaojie," "Li Xiaojie," when used, refer to designations, "aged," "holed," and "inked," when used, refer to domain names, "normalized," "standardized," and "military," when used, refer to certain specifications or criteria, such terms being referred to as terms of a particular suffix in a Chinese corpus.
Another possible implementation is as follows: in english, there is also a difference between a prefix and a suffix,
in english usage, there is a common prefix table containing prefixes and suffixes, also called root words, such as: an-, ant-, pro-, pre-, di s-, un-, etc., often denote that the same root word is used for a class of words.
In some embodiments of the present application, the step 203 of cleaning the real word text based on the preset prefix unit byte, generating the prefix-free text includes filtering the text with the prefix unit byte from the prefix-free text based on the preset prefix unit table, using the elements in the table as the prefix unit byte, and distinguishing the text with the prefix unit byte.
In some embodiments of the present application, the step 203 of cleaning the text without prefix based on the preset suffix unit byte includes filtering text with suffix unit byte from real word text with elements in the table as suffix unit bytes based on the preset suffix unit table, and distinguishing the text with suffix unit byte.
When the text with prefix and suffix unit bytes is expressed in a distinguishing way, the specific distinguishing way is that the words with the prefix and suffix results are firstly inquired from real word texts, and the corresponding words are replaced by specific character symbols or are directly deleted.
Step 204, obtaining all domain terms from the prefix text based on a preset domain term table, and forming a domain term set by the obtained domain terms, wherein the domain term set can contain repeated elements.
The preset domain glossary, a data table composed of specific domain terms, for example: in the chemical field, there are often molecular structures, chemical formula names, etc., and we will refer to such compositional data sheets as are common to form glossary of chemical field; in addition, there are proprietary names in the software domain, for example, in java language, which contains multiple different API classes, we compose these class names into domain glossary. Essentially, the domain glossary is a statistical list of names of proprietary things in a particular domain. Rather these domain glossary appear in scientific papers, indicating that the paper is at least a study of related terms or close relationships to related terms, and therefore, the domain glossary is included as part of the final classification text.
In some embodiments of the present application, in step 204, all domain terms are obtained from the prefix-free text based on a preset domain glossary, and the obtained domain terms form a domain glossary, including filtering domain term vocabulary from the prefix-free text based on a scientific text platform or a professional domain glossary constructed in advance.
And 205, carrying out module recognition on the text without the text based on a preset module mark list and a reference specification, and after the module recognition is completed, carrying out interference text removal on the recognized module based on a preset interference list, wherein the text with the interference text removed is used as metadata text.
The module identification is carried out on the text without the text, and the main purpose is to divide the text without the text into small unit modules so as to facilitate further text removal.
In some embodiments of the present application, in step 205, performing module recognition on the text without a text based on a preset module tag list and a reference specification includes recognizing a position where the text information appears in the text without a text based on text information in the preset module tag list, taking text fragments among different text information as a conventional unit module, acquiring text information in the text without a text based on the reference specification, acquiring a word size, a format, preset punctuation marks among the text and special symbol information of the text, screening out text fragments among texts with different word sizes and formats and preset punctuation marks and special symbol information, and taking the text fragments as a special unit module.
The module identification is carried out on the text without the text based on the preset module mark list, the implementation mode of module identification is as follows: according to the common characteristics of the paper texts, such as a question, an abstract, a keyword, an introduction, a text, a reference, an annex and the like, a module mark list is obtained, when the question is used as a module mark, the question is used as an element in the module mark list, for example, a crawler mode is used, the text between < h1> </h1> tags in the paper text is crawled to be used as the question, when the question is obtained, the corresponding element in the module mark list is triggered, then a crawler mechanism is triggered, and the content contained in the question is directly obtained to be used as a question module; when the elements in the module tag list are acquired in the same manner, the corresponding text content is directly acquired as a different module, and the different module acquired based on this manner is referred to as a conventional unit module.
The module identification is carried out on the text without the batten based on the reference specification, and the realization mode of the module identification is as follows: the entire text is retrieved based on the particular punctuation, for example, to be displayed. The text between the's' is used as a module part, then the split different modules are named, and the different modules acquired based on the mode are named as special unit modules; another implementation of module identification is: font characterization information such as word sizes, formats and the like of different texts is obtained from the prefix-free texts, and texts with the same word sizes and the same formats are used as a module.
In some embodiments of the present application, after the module identification is completed in step 205, performing interference text removal on the identified module based on a preset interference list includes screening a conventional unit module and a special unit module based on interference information in the interference list, and if interference information exists in the conventional unit module and the special unit module, deleting text content corresponding to the interference information, where the interference information includes preset interference text and interference symbols.
The interference information comprises frequently-occurring interference words such as abstract and key words and punctuation marks, and the interference information in the module unit text is cleared through the interference information in the interference list, so that the purity of the text is ensured.
At step 206, the metadata text and the domain term set elements are classified as plain text.
According to the method for cleaning the paper redundant data based on the lexical affix, the text range of the text to be classified can be effectively determined by performing the virtual word cleaning and the prefix suffix cleaning on the text to be classified based on the word characteristics, so that the problem that network resources are excessively consumed when the search range is too large is effectively avoided; the method has the advantages that through acquiring the terms in the text field to be classified, the relevant useful information of a part of the text to be classified is preliminarily determined, and the accuracy of search results is improved; finally, by eliminating the interference information of the text to be classified, namely, the accuracy of searching is improved, and unnecessary waste of a part of network resources is effectively avoided, so that the method for cleaning the paper redundant data based on the lexical affix is beneficial to reasonably reducing the network resource consumption during text classification, and saves the cost of text classification to a certain extent.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by a computer program that is stored on a computer readable storage medium and that, when executed, includes the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-On-y Memory (ROM), or a random access Memory (Random Access Memory, RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution may not necessarily be sequential, but may be performed in rotation or alternatively with at least some of the other steps or stages.
With further reference to fig. 4, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a lexical affix-based paper redundant data cleaning apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.
As shown in fig. 4, the apparatus for cleaning paper redundant data based on lexical affix according to the present embodiment includes: a text acquisition module 401, an imaginary term cleaning module 402, an affix cleaning module 403, a domain term acquisition module 404, an interference information removal module 405 and a text classification module 406. Wherein:
a text obtaining module 401, configured to obtain a text to be classified;
the term cleaning module 402 is configured to perform term cleaning on the text to be classified based on lexical characteristics in natural language processing, and obtain a real word part as a real word text, where the term cleaning includes representing the real word part in the text to be classified with a preset specific symbol based on a real word and term distinguishing rule in the lexical characteristics;
the prefix cleaning module 403 is configured to perform prefix cleaning on the real word text based on a preset grammatical prefix and suffix screening method, obtain a cleaned text, and generate a prefix-free text, where the prefix cleaning includes cleaning the real word text based on a preset prefix unit byte, generating a prefix-free text, and cleaning the prefix-free text based on a preset suffix unit byte;
A domain term obtaining module 404, configured to obtain all domain terms from the prefix text based on a preset domain term table, and form a domain term set from the obtained domain terms, where the domain term set may include repeated elements;
the interference information removing module 405 is configured to perform module recognition on the text without the text based on a preset module tag list and a reference specification, and after the module recognition is completed, perform interference text removal on the recognized module based on the preset interference list, and use the text from which the interference text is removed as a metadata text;
the text classification module 406 is configured to classify the metadata text and the domain term set element as plain text.
In some embodiments of the present application, the term cleaning module 402 is specifically configured to find all terms from the text to be classified based on a term table preset by the lexical property when the term part in the text to be classified is represented by a preset specific symbol based on a real term and term distinguishing rule in the lexical property, and replace text content corresponding to the term by using a preset identifier.
In some embodiments of the present application, as shown in fig. 5, fig. 5 is a schematic structural diagram of a prefix cleaning module 403 in an embodiment of the present application, where the prefix cleaning module 403 includes a prefix cleaning unit 403a and a suffix cleaning unit 403b.
In some embodiments of the present application, the prefix cleaning unit 403a is configured to screen text with prefix unit bytes from prefix-free text with elements in a table as prefix unit bytes based on a preset prefix unit table, and perform differential representation on the text with prefix unit bytes.
In some embodiments of the present application, the suffix cleaning unit 403b is configured to screen, based on a preset suffix unit table, text with suffix unit bytes from prefix-less text using elements in the table as suffix unit bytes, and differentially represent the text with suffix unit bytes.
In some embodiments of the present application, the domain term obtaining module 404 obtains all domain terms from the prefix-free text based on a preset domain term table, and specifically uses the obtained domain terms to screen domain term vocabulary from the prefix-free text based on a scientific text platform or a professional domain term table constructed in advance when the obtained domain terms form a domain term set.
In some embodiments of the present application, as shown in fig. 6, fig. 6 is a schematic structural diagram of an interference information removal module 405 in an embodiment of the present application, where the interference information removal module 405 includes a conventional unit module acquiring unit 405a, a special unit module acquiring unit 405b, and an interference text removing unit 405c.
In some embodiments of the present application, the conventional unit module obtaining unit 405a is configured to identify, based on text information in a preset module tag list, a location where the text information appears in the text without a prefix, and use text segments between different text information as a conventional unit module.
In some embodiments of the present application, the special unit module obtaining unit 405b is configured to obtain text information in a text without a text based on the reference specification, obtain a word size, a format, a preset punctuation mark between the texts, and special symbol information of the text, screen text fragments between the texts with different word sizes and formats and the preset punctuation mark and special symbol information, and use the text fragments as the special unit module.
In some embodiments of the present application, the interference text removing unit 405c is configured to screen the conventional unit module and the special unit module based on the interference information in the interference list, and if the interference information exists in the conventional unit module and the special unit module, delete text content corresponding to the interference information, where the interference information includes preset interference text and interference symbols.
According to the lexical affix-based paper redundant data cleaning device, the text to be classified is subjected to the virtual word cleaning and the prefix suffix cleaning based on the word characteristics, so that the substantial text range of the text to be classified is effectively determined, and the problem of excessive consumption of network resources when the search range is too large is effectively avoided; the method has the advantages that through acquiring the terms in the text field to be classified, the relevant useful information of a part of the text to be classified is preliminarily determined, and the accuracy of search results is improved; finally, by eliminating the interference information of the text to be classified, namely, the accuracy of searching is improved, and unnecessary waste of a part of network resources is effectively avoided, so that the method for cleaning the paper redundant data based on the lexical affix is beneficial to reasonably reducing the network resource consumption during text classification, and saves the cost of text classification to a certain extent.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 7, fig. 7 is a basic structural block diagram of a computer device according to the present embodiment.
The computer device 7 comprises a memory 7a, a processor 7b, a network interface 7c communicatively connected to each other via a system bus. It should be noted that only a computer device 7 having components 7a-7c is shown in the figures, but it should be understood that not all of the illustrated components need be implemented, and that more or fewer components may alternatively be implemented. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an application specific integrated circuit (App l i cat ion Speci fic I ntegrated Ci rcu it, AS IC), a programmable gate array (F iel d-Programmab l e Gate Array, FPGA), a digital processor (D igita l Signa l Processor, DSP), an embedded device, and the like.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
The memory 7a includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 7a may be an internal storage unit of the computer device 7, such as a hard disk or a memory of the computer device 7. In other embodiments, the memory 7a may also be an external storage device of the computer device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 7. Of course, the memory 7a may also comprise both an internal memory unit of the computer device 7 and an external memory device. In this embodiment, the memory 7a is generally used for storing an operating system and various application software installed on the computer device 7, such as program codes of a thesis redundant data cleaning method based on lexical suffixes. Further, the memory 7a may be used to temporarily store various types of data that have been output or are to be output.
The processor 7b may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 7b is typically used to control the overall operation of the computer device 7. In this embodiment, the processor 7b is configured to execute the program code stored in the memory 7a or process data, for example, execute the program code of the lexical affix-based method for cleaning redundant paper data.
The network interface 7c may comprise a wireless network interface or a wired network interface, which network interface 7c is typically used for establishing a communication connection between the computer device 7 and other electronic devices.
The present application further provides another embodiment, namely, provides a non-volatile computer readable storage medium, where a lexical affix-based paper redundancy data cleaning program is stored, where the lexical affix-based paper redundancy data cleaning program may be executed by at least one processor, so that the at least one processor performs the steps of the lexical affix-based paper redundancy data cleaning method as described above.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.
It is apparent that the embodiments described above are only some embodiments of the present application, but not all embodiments, the preferred embodiments of the present application are given in the drawings, but not limiting the patent scope of the present application. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a more thorough understanding of the present disclosure. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing, or equivalents may be substituted for elements thereof. All equivalent structures made by the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the protection scope of the application.

Claims (6)

1. The method for cleaning the paper redundant data based on the lexical affix is characterized by comprising the following steps of:
acquiring a text to be classified;
performing the virtual word cleaning on the text to be classified based on the lexical characteristics in the natural language processing to obtain a real word part as a real word text, wherein the virtual word cleaning comprises the step of representing the virtual word part in the text to be classified by a preset specific symbol based on a real word and virtual word distinguishing rule in the lexical characteristics; the step of representing the participle part in the text to be classified by a preset specific symbol comprises the following steps: searching all the works from the texts to be classified based on the works list preset by the lexical characteristics, and simultaneously replacing text contents corresponding to the works by using preset identification symbols;
performing prefix cleaning on the real word text based on a preset grammar prefix and suffix screening method, obtaining cleaned text, and generating a prefix-free text, wherein the prefix cleaning comprises cleaning the real word text based on preset prefix unit bytes, generating a prefix-free text, and cleaning the prefix-free text based on preset suffix unit bytes; the prefix cleaning includes cleaning the real word text based on a preset prefix unit byte, and the generating the prefix-free text includes: based on a preset prefix unit table, selecting a text with prefix unit bytes from the text without prefix by taking elements in the table as prefix unit bytes, and distinguishing and representing the text with prefix unit bytes; the washing the prefix-free text based on the preset suffix unit bytes comprises: based on a preset suffix unit table, selecting a text with suffix unit bytes from real word texts by taking elements in the table as suffix unit bytes, and distinguishing and representing the text with the suffix unit bytes;
Acquiring all domain terms from the prefix text based on a preset domain term table, and forming a domain term set by the acquired domain terms, wherein the domain term set can contain repeated elements;
performing module recognition on the text without the text based on a preset module mark list and a reference specification, removing the interference text of the recognized module based on a preset interference list after the module recognition is completed, and taking the text with the interference text removed as a metadata text; the module identification of the text without the batten based on the preset module mark list and the reference specification comprises the following steps: based on the text information in a preset module mark list, identifying the position of the text information in the prefix-free text, and taking text fragments among different text information as a conventional unit module; acquiring word information in the text without the string based on the reference specification, acquiring word sizes, formats of the texts, preset punctuation marks and special symbol information among the texts, screening out text fragments among the texts with different word sizes and formats and the preset punctuation marks and the special symbol information, and taking the text fragments as a special unit module;
The metadata text and the domain term set elements are classified as plain text.
2. The method for cleaning paper redundant data based on lexical affix according to claim 1, wherein the obtaining all domain terms from the prefix text based on the preset domain glossary, and composing the obtained domain terms into a domain term set includes:
based on a scientific text platform or a professional field glossary constructed in advance, the field glossary is screened from the text without the text with the no text.
3. The method for cleaning paper redundant data based on lexical affix according to claim 1, wherein the removing the interference text of the identified module based on the preset interference list after the module identification is completed comprises:
and screening the conventional unit module and the special unit module based on the interference information in the interference list, and deleting text content corresponding to the interference information if the conventional unit module and the special unit module have the interference information, wherein the interference information comprises preset interference characters and interference symbols.
4. The utility model provides a redundant data belt cleaning device of thesis based on lexical affix which characterized in that includes:
The text acquisition module is used for acquiring texts to be classified;
the system comprises an imaginary term cleaning module, an imaginary term cleaning module and a real term processing module, wherein the imaginary term cleaning module is used for carrying out imaginary term cleaning on the text to be classified based on lexical characteristics in natural language processing to obtain real term parts as real term texts, and the imaginary term cleaning comprises the steps of representing the imaginary term parts in the text to be classified by preset specific symbols based on real terms and imaginary term distinguishing rules in the lexical characteristics; the step of representing the participle part in the text to be classified by a preset specific symbol comprises the following steps: searching all the works from the texts to be classified based on the works list preset by the lexical characteristics, and simultaneously replacing text contents corresponding to the works by using preset identification symbols;
the prefix cleaning module is used for cleaning the prefix of the real word text based on a preset grammar prefix and suffix screening method, obtaining the cleaned text and generating a prefix-free text, wherein the prefix cleaning comprises the steps of cleaning the real word text based on preset prefix unit bytes, generating the prefix-free text and cleaning the prefix-free text based on preset suffix unit bytes; the prefix cleaning includes cleaning the real word text based on a preset prefix unit byte, and the generating the prefix-free text includes: based on a preset prefix unit table, selecting a text with prefix unit bytes from the text without prefix by taking elements in the table as prefix unit bytes, and distinguishing and representing the text with prefix unit bytes; the washing the prefix-free text based on the preset suffix unit bytes comprises: based on a preset suffix unit table, selecting a text with suffix unit bytes from real word texts by taking elements in the table as suffix unit bytes, and distinguishing and representing the text with the suffix unit bytes;
A domain term obtaining module, configured to obtain all domain terms from the prefix text based on a preset domain term table, and configure the obtained domain terms into a domain term set, where the domain term set may include repeated elements;
the interference information removing module is used for carrying out module identification on the text without the text based on a preset module mark list and a reference standard, removing the interference text of the identified module based on the preset interference list after the module identification is completed, and taking the text with the interference text removed as a metadata text; the module identification of the text without the batten based on the preset module mark list and the reference specification comprises the following steps: based on the text information in a preset module mark list, identifying the position of the text information in the prefix-free text, and taking text fragments among different text information as a conventional unit module; acquiring word information in the text without the string based on the reference specification, acquiring word sizes, formats of the texts, preset punctuation marks and special symbol information among the texts, screening out text fragments among the texts with different word sizes and formats and the preset punctuation marks and the special symbol information, and taking the text fragments as a special unit module;
And the text classification module is used for classifying the metadata text and the elements in the domain term set as pure text.
5. A computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor, when executing the computer program, implementing the steps of the lexical affix-based paper redundancy data cleaning method of any one of claims 1 to 3.
6. A non-transitory computer readable storage medium, wherein a computer program is stored on the non-transitory computer readable storage medium, which when executed by a processor, implements the steps of the lexical affix-based paper redundancy data cleaning method of any one of claims 1 to 3.
CN202211586218.0A 2022-12-09 2022-12-09 Thesis redundant data cleaning method and device based on lexical affix and storage medium Active CN115796160B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211586218.0A CN115796160B (en) 2022-12-09 2022-12-09 Thesis redundant data cleaning method and device based on lexical affix and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211586218.0A CN115796160B (en) 2022-12-09 2022-12-09 Thesis redundant data cleaning method and device based on lexical affix and storage medium

Publications (2)

Publication Number Publication Date
CN115796160A CN115796160A (en) 2023-03-14
CN115796160B true CN115796160B (en) 2024-04-09

Family

ID=85418659

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211586218.0A Active CN115796160B (en) 2022-12-09 2022-12-09 Thesis redundant data cleaning method and device based on lexical affix and storage medium

Country Status (1)

Country Link
CN (1) CN115796160B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6757692B1 (en) * 2000-06-09 2004-06-29 Northrop Grumman Corporation Systems and methods for structured vocabulary search and classification
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis
CN112036120A (en) * 2020-08-31 2020-12-04 上海硕恩网络科技股份有限公司 Skill phrase extraction method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6978275B2 (en) * 2001-08-31 2005-12-20 Hewlett-Packard Development Company, L.P. Method and system for mining a document containing dirty text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6757692B1 (en) * 2000-06-09 2004-06-29 Northrop Grumman Corporation Systems and methods for structured vocabulary search and classification
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis
CN112036120A (en) * 2020-08-31 2020-12-04 上海硕恩网络科技股份有限公司 Skill phrase extraction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴绪玲.基于混合优化的双模深度学习文本分类方法.西南大学学报(自然科学版).2022,第44卷(第11期),第235-241页. *

Also Published As

Publication number Publication date
CN115796160A (en) 2023-03-14

Similar Documents

Publication Publication Date Title
CN111581976B (en) Medical term standardization method, device, computer equipment and storage medium
US10095780B2 (en) Automatically mining patterns for rule based data standardization systems
US8380492B2 (en) System and method for text cleaning by classifying sentences using numerically represented features
CN111177532A (en) Vertical search method, device, computer system and readable storage medium
CN109460551B (en) Signature information extraction method and device
RU2613846C2 (en) Method and system for extracting data from images of semistructured documents
US20120158742A1 (en) Managing documents using weighted prevalence data for statements
CN112883730B (en) Similar text matching method and device, electronic equipment and storage medium
WO2021218027A1 (en) Method and apparatus for extracting terminology in intelligent interview, device, and medium
CN112287069A (en) Information retrieval method and device based on voice semantics and computer equipment
CN111475603A (en) Enterprise identifier identification method and device, computer equipment and storage medium
CN112541359A (en) Document content identification method and device, electronic equipment and medium
CN112084342A (en) Test question generation method and device, computer equipment and storage medium
CN115687655A (en) PDF document-based knowledge graph construction method, system, equipment and storage medium
CN111160007B (en) Search method and device based on BERT language model, computer equipment and storage medium
CN115796160B (en) Thesis redundant data cleaning method and device based on lexical affix and storage medium
CN110222179B (en) Address book text classification method and device and electronic equipment
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN111639250A (en) Enterprise description information acquisition method and device, electronic equipment and storage medium
CN113051396B (en) Classification recognition method and device for documents and electronic equipment
CN115203445A (en) Multimedia resource searching method, device, equipment and medium
CN114330240A (en) PDF document analysis method and device, computer equipment and storage medium
CN114417862A (en) Text matching method, and training method and device of text matching model
CN115481240A (en) Data asset quality detection method and detection device
CN114064906A (en) Emotion classification network training method and emotion classification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant