CN111209737B - Method for screening out noise document and computer readable storage medium - Google Patents

Method for screening out noise document and computer readable storage medium Download PDF

Info

Publication number
CN111209737B
CN111209737B CN201911398056.6A CN201911398056A CN111209737B CN 111209737 B CN111209737 B CN 111209737B CN 201911398056 A CN201911398056 A CN 201911398056A CN 111209737 B CN111209737 B CN 111209737B
Authority
CN
China
Prior art keywords
word
text
words
keyword
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911398056.6A
Other languages
Chinese (zh)
Other versions
CN111209737A (en
Inventor
王子玥
章正道
栾江霞
徐晓文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meiya Pico Information Co Ltd
Original Assignee
Xiamen Meiya Pico Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meiya Pico Information Co Ltd filed Critical Xiamen Meiya Pico Information Co Ltd
Priority to CN201911398056.6A priority Critical patent/CN111209737B/en
Publication of CN111209737A publication Critical patent/CN111209737A/en
Application granted granted Critical
Publication of CN111209737B publication Critical patent/CN111209737B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a screening method of noise documents and a computer readable storage medium, wherein the method comprises the following steps: according to the seed word set, retrieving to obtain an original corpus; extracting effective texts from the original corpus; dividing sentences of the effective texts, and performing data cleaning; obtaining key words in the co-occurrence sentences to obtain a keyword set; obtaining a related class keyword list according to the seed word set, the keyword set and a preset related class high-frequency word set; respectively calculating the appearance proportion of each relevant word in the relevant class keyword table in the effective text as a key syntactic component to obtain the keyword weight of each relevant word; respectively calculating the weight of the key words of each irrelevant word; acquiring related words and irrelevant words in the effective text, and calculating the score of the effective text according to the corresponding weight of the keywords; and if the score is smaller than a preset threshold value, judging the text to be the noise text. The invention can eliminate irrelevant texts and improve the corpus quality of the search results.

Description

Method for screening out noise document and computer readable storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method for screening out noise documents and a computer-readable storage medium.
Background
With the rapid expansion and growth of network data, data search is a key way for people to extract required information from massive data. Related knowledge of news, comments and the like of the events of the objects of interest can be acquired by effectively setting search conditions and key fields. Meanwhile, the establishment of each closed-loop big data center also generally requires a data search service separated from the internet environment. Accurate data search based on semantic information can help people to obtain required information in a local environment, the requirement of data closed loop is guaranteed, meanwhile, high-quality search results are provided, and convenience is brought to data governance.
In the prior art, database retrieval content optimization is mainly divided into two directions, namely online optimization which is mainly based on a topological structure of a webpage link, such as PageRank and the like; and secondly, optimizing off-line data search results, which usually depends on a machine learning method to perform label training, dividing data into two types of related and noise samples, and performing training classification by using a support vector machine or Bayes and other methods. But the online algorithm depends on the links among the contents and the browsing tracks of the internet users, which are characteristic information which does not exist or can not be obtained in an offline database; the machine learning method training classification mainly has the characteristics of large labor consumption and poor generalization performance. The frequency of database searching is reduced by the organization personnel or the requirement of the searcher to label the data before searching the data, so that the man-machine efficiency is reduced.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: a method for screening out noise documents and a computer readable storage medium are provided, which can effectively remove noise corpora irrelevant to a target in a search result and retain corpora relevant to the search target.
In order to solve the technical problems, the invention adopts the technical scheme that: a method of screening out noisy documents, comprising:
retrieving according to a preset seed word set to obtain an original corpus;
extracting effective texts from the original corpus according to the format of the original corpus;
the effective texts are divided into sentences, and data cleaning is carried out on the effective texts;
performing word segmentation on the effective text, and performing part-of-speech recognition and syntactic analysis on each word obtained by word segmentation to obtain part-of-speech and syntactic components of each word;
acquiring a co-occurrence sentence containing at least one seed word from each clause of the effective text;
acquiring key words in the co-occurrence sentences according to preset key syntax components and key parts of speech to obtain a key word set;
obtaining a related class keyword list according to the seed word set, the keyword set and a preset related class high-frequency word set;
respectively calculating the appearance proportion of each relevant word in the relevant class keyword table as a key syntactic component in the effective text to obtain the keyword weight of each relevant word, wherein the keyword weight of each relevant word is a positive value;
obtaining an irrelevant keyword list according to a preset irrelevant high-frequency word set;
respectively calculating the appearance proportion of each irrelevant word in the irrelevant key word list as a key syntactic component in the effective text to obtain the keyword weight of each irrelevant word, wherein the keyword weight of each irrelevant word is a negative value;
acquiring related words and irrelevant words in the effective text according to the related keyword table and the irrelevant keyword table, and calculating the score of the effective text according to the corresponding keyword weight;
and if the score of the effective text is smaller than a preset threshold value, judging that the effective text is a noise text.
The invention also proposes a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps as described above.
The invention has the beneficial effects that: removing character-level noise information and language segments with less semantics or incomplete semantics by cleaning data; by carrying out sentence segmentation and word segmentation on the effective text, the subsequent analysis of each sentence and the matching of each word are facilitated; the relevance of key words is ensured by acquiring the key words from the co-occurrence sentences; the method comprises the steps that a seed word set, a keyword set and a preset related high-frequency word set are combined to obtain a related keyword list, and a group of related words of a basic coverage event can be formed; obtaining the keyword weights of the related words and the irrelevant words by calculating the appearance proportion of the related words and the irrelevant words in the effective text as key syntactic components, so that the weights are evaluated by scoring different positions, wherein part of the positions are higher, and the other positions are lower or zero; and calculating the score of the effective text according to the number of relevant words and irrelevant words hit by the effective text and the keyword weight of the relevant words and the irrelevant words, and finally judging whether the effective text is the noise text according to the score.
The method can be used for solving the problems of poor search results and more noise corpora in the database search according to the seed words; the keyword list is semantically expanded, so that coarse search data can be screened, irrelevant texts are removed, the corpus quality of search results is improved, and convenience is provided for data center management data.
Drawings
FIG. 1 is a flow chart of a method for screening out noise documents according to the present invention;
fig. 2 is a flowchart of a method according to a first embodiment of the invention.
Detailed Description
In order to explain technical contents, objects and effects of the present invention in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.
The most key concept of the invention is as follows: extracting more keywords based on seed words by combining two-dimensional semantic features of part of speech information and syntax information and feature weights; extracting related high-frequency words and unrelated high-frequency words according to the related samples and the unrelated samples; taking the appearance proportion of the relevant words and the irrelevant words as key syntactic components as the corresponding keyword weight; and calculating text scores according to the relevant words and irrelevant words hit by the text and the weight of the keywords of the relevant words and irrelevant words, and judging the text type according to the scores.
Referring to fig. 1, a method for screening out a noise document includes:
retrieving according to a preset seed word set to obtain an original corpus;
extracting effective texts from the original corpus according to the format of the original corpus;
the effective texts are divided into sentences, and data cleaning is carried out on the effective texts;
performing word segmentation on the effective text, and performing part-of-speech recognition and syntactic analysis on each word obtained by word segmentation to obtain part-of-speech and syntactic components of each word;
acquiring a co-occurrence sentence containing at least one seed word from each clause of the effective text;
acquiring key words in the co-occurrence sentences according to preset key syntax components and key parts of speech to obtain a key word set;
obtaining a related class keyword list according to the seed word set, the keyword set and a preset related class high-frequency word set;
respectively calculating the appearance proportion of each relevant word in the relevant class keyword table as a key syntactic component in the effective text to obtain the keyword weight of each relevant word, wherein the keyword weight of each relevant word is a positive value;
obtaining an irrelevant keyword list according to a preset irrelevant high-frequency word set;
respectively calculating the appearance proportion of each irrelevant word in the irrelevant keyword list as a key syntactic component in the effective text to obtain the keyword weight of each irrelevant word, wherein the keyword weight of each irrelevant word is a negative value;
acquiring related words and irrelevant words in the effective text according to the related key word table and the irrelevant key word table, and calculating the score of the effective text according to the corresponding weight of the key words;
and if the score of the effective text is smaller than a preset threshold value, judging that the effective text is a noise text.
From the above description, the beneficial effects of the present invention are: the method can be used for solving the problems of poor search results and more noise corpora in the database search according to the seed words.
Further, if the score of the valid text is smaller than a preset threshold, after determining that the valid text is a noise text, the method further includes:
and deleting the noise text.
From the above description, by deleting the noise text in the search result, the accuracy of the search result is improved.
Further, the sentence dividing and data cleaning of the effective text specifically includes:
according to a preset sentence break symbol, the effective text is divided into sentences;
filtering characters in the effective text according to a preset character blacklist, wherein the character blacklist comprises English symbols, English letters and Chinese symbols except for sentence break symbols;
and filtering the clauses in the effective text according to the preset length of the language segment.
From the above description, it can be known that the noise information at the character level in the effective text can be filtered, and meanwhile, the speech segments with less semantics or missing semantics can be filtered.
Further, the obtaining of the key words in the co-occurrence sentence according to preset key syntax components and key parts of speech specifically includes:
if the part of speech of a word in the co-occurrence sentence belongs to a preset key part of speech and the syntactic component of the word belongs to a preset key syntactic component, taking the word as a key word;
and obtaining key words in the co-occurrence sentences to obtain a keyword set.
Further, before obtaining the related class keyword list according to the seed word set, the keyword set and the preset related class high-frequency word set, the method further includes:
acquiring a preset first sample and a preset second sample, wherein the first sample is a sample related to the expected search content, and the second sample is a sample unrelated to the expected search content;
performing word segmentation on the first sample and the second sample respectively to obtain a second word set and a third word set;
and respectively obtaining the words with the highest word frequency in the second word set and the third word set and obtaining a related high-frequency word set and an unrelated high-frequency word set.
According to the description, the relevant high-frequency words and the irrelevant high-frequency words are extracted from the relevant samples and the irrelevant samples, so that the screening quality and manual controllability can be ensured.
Further, after the performing word segmentation on the first sample and the second sample to obtain a second word set and a third word set, the method further includes:
and deleting stop words in the second word set and the third word set or deleting stop words in the related high-frequency word set and the unrelated high-frequency word set according to a preset stop word list.
Further, after obtaining the related class keyword list according to the seed word set, the keyword set and the preset related class high-frequency word set, the method further includes:
and deleting the stop words in the related keyword list according to a preset stop word list.
According to the description, the stop words are deleted, so that the influence of the stop words on the recognition result of the subsequent text can be avoided, and the recognition accuracy is ensured.
Further, the obtaining of the related terms and the unrelated terms in the effective text according to the related keyword table and the unrelated keyword table, and calculating the score of the effective text according to the corresponding keyword weight specifically include:
respectively acquiring related words and irrelevant words in each clause of an effective text according to the related keyword list and the irrelevant keyword list;
calculating the score of each clause according to the weight of the keywords of the related words and the weight of the keywords of the unrelated words in each clause;
and calculating the score of the effective text according to the proportion of each clause in the effective text and the score of each clause.
As can be seen from the above description, the more related words contained in the valid text, the higher the score, and the more unrelated words contained, the lower the score.
Further, before calculating the score of the valid text according to the specific gravity of each clause in the valid text and the score of each clause, the method further includes:
respectively counting the number of characters of each clause to obtain the length of each clause;
counting the total number of words of the effective text to obtain the length of the effective text;
and respectively calculating the proportion of the length of each clause in the length of the effective text to obtain the proportion of each clause in the effective text.
As can be seen from the above description, the specific gravity is the ratio of the length of a single clause to the total length of the text.
The invention also proposes a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps as described above.
Example one
Referring to fig. 2, a first embodiment of the present invention is: a method for screening out noise document can be used for screening out noise document in Internet information, and is suitable for providing search optimization in computer database system. The method comprises the following steps:
s1: according to a preset seed word set, original linguistic data are obtained by retrieving from a database; specifically, the seed words are used as search words, and corpus information which is successfully matched is obtained from a local database.
The method for obtaining the original corpus from the database is seed word logic matching search, and a group of seed words are connected through logic of 'AND', 'OR', so that the text is distinguished, for example, the 'A nation or C city' represents the text in which the 'A nation' or the 'C city' appears in the text.
In this embodiment, internet information is taken as an example, and the obtained corpus is mainly news, an online report and related comment information. Due to the fact that local database searching is conducted, different texts are obtained according to seed word matching and are irrelevant to pages, urls and the like where original data are located.
S2: and extracting effective texts from the original corpus according to the format of the original corpus. Namely, the original corpus is sorted, and different format contents are extracted and stored aiming at different types of texts.
Because news reports, social comments, netizen postings and the like all have different text formats, fine-grained effective information fields, such as a news title part, a text part, a netizen-released speech part, a forwarding part and the like, are extracted according to different formats. The embodiment mainly aims at the data sorting format of long texts such as news and internet texts and microblog short texts. The news title is crawled by a crawler according to the settings of different news webpages, and different pages are different. The microblog short texts are mainly sorted in format, the microblog platform is similar to other social platforms, the posts of the microblog short texts have strong format information, different symbols represent different meanings, and the microblog short texts are matched according to various formats so as to extract main sentence information of the texts.
Specifically, the effective text extracted in this embodiment includes a body title, body content, a forwarding title, and forwarding content. Firstly, extracting effective texts except the text contents from the original corpus according to a preset regular expression. E.g. according to "#? # ", extracting topic contents; according to "[ words? "extract the title content of the forwarded article; according "/@? And extracting the content of the forwarding account. The text content can then be obtained by deleting these contents other than the text content in the original corpus.
S3: and carrying out sentence segmentation on the effective text, and carrying out data cleaning on the effective text. In this embodiment, the data cleaning mainly removes punctuation and removes semantic less language segments.
Firstly, sentence boundary definition is carried out on the effective text according to preset punctuation marks, wherein the punctuation marks comprise commas, semicolons, periods, exclamation marks, question marks and long spaces, and each clause of the effective text is obtained.
Since there may be character-level noise information in the valid text, such as non-chinese characters, unrecognizable characters, redundant punctuation, etc., the characters in the valid text need to be filtered. Specifically, the characters belonging to a preset character blacklist in the valid text are removed, the character blacklist includes english symbols (symbols after 0x21 in acsii), english letters (capital letters and lowercase letters in ascii), and chinese symbols other than sentence break symbols, and may also include some characters which cannot be recognized. By removing the preset characters, the noise information at the character level in the effective text can be filtered.
Meanwhile, partial speech segments which have too little semantic information and cannot form sentences exist in the effective text, for example, the netizens forward the incomplete information of other people, and the speech segments which are composed of one word and two words for simply expressing emotion also need to be filtered. Specifically, the number of words in each clause of the valid text is counted as the length of each clause, and then clauses with lengths smaller than a preset paragraph length (e.g., 4) are deleted. And further deleting language sections in a preset format in the effective text, such as the top-up and the loss of the mail, the forwarding account name of the microblog text and the like.
S4: performing word segmentation on the effective text, and performing part-of-speech recognition and syntactic analysis on each word obtained by word segmentation to obtain part-of-speech and syntactic components of each word; further, the part-of-speech recognition and syntactic analysis and results are stored as features in a unified manner with corresponding words in the valid text. The method comprises the steps of performing word segmentation, part of speech recognition and syntactic analysis on the sorted corpus data, and storing each semantic information and effective text in a standardized manner, wherein the semantic information refers to the results of word segmentation, part of speech recognition and syntactic analysis.
S5: and acquiring a co-occurrence sentence containing at least one seed word from each clause of the effective text, namely if a clause in the effective text contains at least one seed word, the clause is a co-occurrence sentence.
For example, assuming that the seed word includes "nation a" and "city C", if "nation a" or "city C" appears in one sentence of the valid text, or both words appear at the same time, the sentence is considered as a co-occurrence sentence.
S6: and acquiring key words in the co-occurrence sentences according to preset key syntax components and key parts of speech to obtain a keyword set, namely acquiring words of which the parts of speech belong to the preset key parts of speech and the parts of speech belong to the preset key parts of speech in each co-occurrence sentence respectively, and adding the words into the keyword set.
The syntax component content includes seven types of subjects, predicates, objects, determiners, subjects, complements, and core verbs, and the key syntax component in the present embodiment includes a subject, a predicate, an object, and a core verb. The part of speech is the part of speech tagging standard of national standard 863, and the key part of speech of the embodiment comprises adjectives, morphemes, idioms, abbreviations, suffixes, numbers, general nouns, directional nouns, human names, organization names, place names, time nouns, other nouns, vocabularies and verbs.
S7: and acquiring a related high-frequency word set and an irrelevant high-frequency word set.
Specifically, a preset first sample and a preset second sample are obtained, wherein the first sample is a sample related to the expected search content, and the second sample is a sample unrelated to the expected search content; then, performing word segmentation on the first sample and the second sample respectively to obtain a second word set and a third word set; and respectively obtaining words with the highest word frequency preset number (such as 10) in the second word set and the third word set to obtain a related high-frequency word set and an unrelated high-frequency word set.
The first sample and the second sample can be randomly obtained from an original corpus, and then are manually marked, and the marked content is whether related to the expected search content or not; the already labeled sample may also be obtained additionally.
And further, after the second word set and the third word set are obtained, or after the related class high-frequency word set and the unrelated class high-frequency word set are obtained, deleting the stop words in the related class high-frequency word set and the unrelated class high-frequency word set according to a preset stop word list.
By acquiring the related high-frequency word set and the irrelevant high-frequency word set from the related samples and the irrelevant samples, the subsequent screening quality can be ensured. By deleting stop words such as auxiliary words, prepositions and the like, the influence of the stop words on the recognition result of the subsequent text can be avoided.
S8: obtaining a related class keyword list according to the seed word set, the keyword set and a preset related class high-frequency word set; the words in the three sets are collected to obtain a related class keyword list.
And further, deleting the stop words in the related keyword list according to a preset stop word list.
S9: and respectively calculating the appearance proportion of each relevant word in the relevant class keyword table as a key syntactic component in the effective text to obtain the keyword weight of each relevant word, wherein the keyword weight of each relevant word is a positive value.
For example, in a valid text, the related word "nation a" appears 4 times in total, 1 time being the subject and the other 3 times being the final phrase, and the keyword weight of "nation a" is 1/4 ═ 0.25.
S10: obtaining an irrelevant keyword list according to a preset irrelevant high-frequency word set; i.e. the set of irrelevant class high-frequency words is used as the irrelevant class keyword list.
S11: and respectively calculating the appearance proportion of each irrelevant word in the irrelevant keyword list as a key syntactic component in the effective text to obtain the keyword weight of each irrelevant word, wherein the keyword weight of each irrelevant word is a negative value.
For example, in a valid text, the irrelevant word "nation B" appears 4 times in total, with 1 time as the subject and 3 times as the final phrase, and then the keyword weight of "nation B" is-1/4 ═ 0.25.
S12: and acquiring related words and irrelevant words in the effective text according to the related keyword list and the irrelevant keyword list, and calculating the score of the effective text according to the corresponding keyword weight.
Specifically, related words and irrelevant words in each clause of an effective text are respectively obtained according to the related keyword list and the irrelevant keyword list; calculating the score of each clause according to the keyword weight of the related words and the keyword weight of the unrelated words in each clause; and calculating the score of the effective text according to the specific gravity of each clause in the effective text and the score of each clause. And calculating the proportion of the length (number of characters) of each clause in the total length of the effective text to obtain the proportion of each clause in the effective text.
This step can be calculated according to the following formula:
Figure GDA0003673759150000101
wherein S is the valid text and S is the valid textLength of text, S i Is the ith clause of S, | S i L is the length of the ith clause, j is S i The total number of related words and irrelevant words in the text can be called S i Hit j keywords, p j I.e. the keyword weight of the jth related word or irrelevant word.
As can be seen from the above description, in a clause, the more relevant words hit, the higher the score, and the more irrelevant words hit, the lower the score. And summing the scores of all the clauses to obtain the score of the whole sample. When more relevant words are hit in each clause of sample S and less irrelevant words are hit, the higher the score of sample S, the more likely it is not a noisy text.
S13: and judging whether the score of a valid text is smaller than a preset threshold value, if so, executing the step S14, and if not, regarding the valid text as a related sample. Wherein, the threshold value can be adjusted according to the actual situation.
When all clauses hit at most 1 keyword, score has an upper limit of 1 and a lower limit of-1; a preferred threshold is 0.5, i.e. typically more than half of the clauses hit the relevant word on average, i.e. the current sample is considered not to belong to noisy text.
S14: and judging the effective text as a noise text, and deleting the effective text. And deleting the noise text, wherein the remaining effective text is the related text, so that the problem of more noise linguistic data in the search result can be improved.
In the embodiment, through data cleaning, character-level noise information and language segments with less semantics or incomplete semantics are removed; by carrying out sentence segmentation and word segmentation on the effective text, the subsequent analysis of each sentence and the matching of each word are facilitated; the relevance of key words is ensured by acquiring the key words from the co-occurrence sentences; by extracting the related high-frequency words and the unrelated high-frequency words from the related samples and the unrelated samples, the screening quality and manual controllability can be ensured; the method comprises the steps that a seed word set, a keyword set and a preset related high-frequency word set are combined to obtain a related keyword list, and a group of related words of a basic coverage event can be formed; by deleting the stop words, the influence of the stop words on the recognition result of the subsequent text can be avoided, and the recognition accuracy is ensured; the related words and the irrelevant words are used as the appearance proportion of key syntactic components in the effective text and are used as the weight of the keywords of the related words and the irrelevant words, so that the weight is evaluated by scoring different positions, part of the positions are higher, and the other positions are lower or zero; and calculating the score of the effective text according to the number of relevant words and irrelevant words hit by the effective text and the keyword weight of the relevant words and the irrelevant words, and finally judging whether the effective text is a noise text or not according to the score.
According to the embodiment, the noise corpora irrelevant to the target in the search result can be removed, and the corpora relevant to the search target are reserved, so that the problems of poor search result and more noise corpora in the database search according to the seed words can be solved; the keyword list is semantically expanded, so that coarse search data can be screened, irrelevant texts are removed, the corpus quality of search results is improved, and convenience is provided for data center management data.
Example two
The present embodiment is a computer-readable storage medium corresponding to the above-mentioned embodiments, on which a computer program is stored, the program, when executed by a processor, implementing the steps of:
retrieving according to a preset seed word set to obtain an original corpus;
extracting effective texts from the original corpus according to the format of the original corpus;
the effective texts are divided into sentences, and data cleaning is carried out on the effective texts;
performing word segmentation on the effective text, and performing part-of-speech recognition and syntactic analysis on each word obtained by word segmentation to obtain part-of-speech and syntactic components of each word;
acquiring a co-occurrence sentence containing at least one seed word from each clause of the effective text;
acquiring key words in the co-occurrence sentences according to preset key syntax components and key parts of speech to obtain a key word set;
obtaining a related class keyword list according to the seed word set, the keyword set and a preset related class high-frequency word set;
respectively calculating the appearance proportion of each relevant word in the relevant class keyword table as a key syntactic component in the effective text to obtain the keyword weight of each relevant word, wherein the keyword weight of each relevant word is a positive value;
obtaining an irrelevant keyword list according to a preset irrelevant high-frequency word set;
respectively calculating the appearance proportion of each irrelevant word in the irrelevant keyword list as a key syntactic component in the effective text to obtain the keyword weight of each irrelevant word, wherein the keyword weight of each irrelevant word is a negative value;
acquiring related words and irrelevant words in the effective text according to the related key word table and the irrelevant key word table, and calculating the score of the effective text according to the corresponding weight of the key words;
and if the score of the effective text is smaller than a preset threshold value, judging that the effective text is a noise text.
Further, if the score of the valid text is smaller than a preset threshold, after the valid text is determined to be a noise text, the method further includes:
and deleting the noise text.
Further, the sentence dividing and data cleaning of the effective text specifically includes:
according to a preset sentence break symbol, the effective text is divided into sentences;
filtering characters in the effective text according to a preset character blacklist, wherein the character blacklist comprises English symbols, English letters and Chinese symbols except for sentence break symbols;
and filtering the clauses in the effective text according to the preset length of the language segment.
Further, the obtaining of the key words in the co-occurrence sentence according to preset key syntax components and key parts of speech specifically includes:
if the part of speech of a word in the co-occurrence sentence belongs to a preset key part of speech and the syntactic component of the word belongs to a preset key syntactic component, taking the word as a key word;
and obtaining key words in the co-occurrence sentences to obtain a keyword set.
Further, before obtaining the related class keyword list according to the seed word set, the keyword set and the preset related class high-frequency word set, the method further includes:
acquiring a preset first sample and a preset second sample, wherein the first sample is a sample related to the expected search content, and the second sample is a sample unrelated to the expected search content;
performing word segmentation on the first sample and the second sample respectively to obtain a second word set and a third word set;
and respectively obtaining the words with the highest word frequency in the second word set and the third word set and obtaining a related high-frequency word set and an unrelated high-frequency word set.
Further, after the word segmentation is performed on the first sample and the second sample respectively to obtain a second word set and a third word set, the method further includes:
and deleting stop words in the second word set and the third word set or deleting stop words in the related high-frequency word set and the unrelated high-frequency word set according to a preset stop word list.
Further, after obtaining the related class keyword list according to the seed word set, the keyword set and the preset related class high-frequency word set, the method further includes:
and deleting the stop words in the related keyword list according to a preset stop word list.
Further, the obtaining of the related terms and the unrelated terms in the effective text according to the related keyword table and the unrelated keyword table, and calculating the score of the effective text according to the corresponding keyword weight specifically include:
respectively acquiring related words and irrelevant words in each clause of an effective text according to the related keyword list and the irrelevant keyword list;
calculating the score of each clause according to the weight of the keywords of the related words and the weight of the keywords of the unrelated words in each clause;
and calculating the score of the effective text according to the proportion of each clause in the effective text and the score of each clause.
Further, before calculating the score of the valid text according to the specific gravity of each clause in the valid text and the score of each clause, the method further includes:
respectively counting the number of characters of each clause to obtain the length of each clause;
counting the total number of words of the effective text to obtain the length of the effective text;
and respectively calculating the proportion of the length of each clause in the length of the effective text to obtain the proportion of each clause in the effective text.
In summary, the method for screening out noise documents and the computer-readable storage medium provided by the present invention can remove the noise corpora irrelevant to the target in the search result, and retain the corpora relevant to the search target, so as to solve the problems of poor search result and more noise corpora in the database search according to the seed words; the keyword list is semantically expanded, so that coarse search data can be screened, irrelevant texts are removed, the corpus quality of search results is improved, and convenience is provided for data center management data.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to the related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method of screening noisy documents, comprising:
retrieving to obtain an original corpus according to a preset seed word set;
extracting effective texts from the original corpus according to the format of the original corpus;
the effective texts are divided into sentences, and data cleaning is carried out on the effective texts;
performing word segmentation on the effective text, and performing part-of-speech recognition and syntactic analysis on each word obtained by word segmentation to obtain part-of-speech and syntactic components of each word;
acquiring a co-occurrence sentence containing at least one seed word from each clause of the effective text;
acquiring key words in the co-occurrence sentences according to preset key syntax components and key parts of speech to obtain a key word set;
obtaining a related class keyword list according to the seed word set, the keyword set and a preset related class high-frequency word set;
respectively calculating the appearance proportion of each relevant word in the relevant keyword list as a key syntactic component in the effective text to obtain the keyword weight of each relevant word, wherein the keyword weight of each relevant word is a positive value;
obtaining an irrelevant keyword list according to a preset irrelevant high-frequency word set;
respectively calculating the appearance proportion of each irrelevant word in the irrelevant key word list as a key syntactic component in the effective text to obtain the keyword weight of each irrelevant word, wherein the keyword weight of each irrelevant word is a negative value;
acquiring related words and irrelevant words in the effective text according to the related keyword table and the irrelevant keyword table, and calculating the score of the effective text according to the corresponding keyword weight;
and if the score of the effective text is smaller than a preset threshold value, judging that the effective text is a noise text.
2. The method for screening out noisy documents according to claim 1, wherein if the score of the valid text is smaller than a preset threshold, the method further comprises, after determining that the valid text is a noisy text:
and deleting the noise text.
3. The method for screening out noisy documents according to claim 1, wherein the dividing the sentences into the valid texts and the data cleansing of the valid texts are specifically:
according to a preset sentence break symbol, the effective text is divided into sentences;
filtering characters in the effective text according to a preset character blacklist, wherein the character blacklist comprises English symbols, English letters and Chinese symbols except for sentence break symbols;
and filtering the clauses in the effective text according to the preset length of the language segment.
4. The method for screening out noise documents according to claim 1, wherein the key words in the co-occurrence sentences are obtained according to preset key syntax components and key parts of speech, and the obtained key word set specifically comprises:
if the part of speech of a word in the co-occurrence sentence belongs to a preset key part of speech and the syntactic component of the word belongs to a preset key syntactic component, taking the word as a key word;
and obtaining key words in the co-occurrence sentences to obtain a keyword set.
5. The method for screening out noisy documents according to claim 1, wherein before obtaining the related class keyword list according to the seed word set, the keyword set and the preset related class high frequency word set, further comprising:
acquiring a preset first sample and a preset second sample, wherein the first sample is a sample related to the expected search content, and the second sample is a sample unrelated to the expected search content;
performing word segmentation on the first sample and the second sample respectively to obtain a second word set and a third word set;
and respectively obtaining the words with the highest word frequency in the second word set and the third word set and obtaining a related high-frequency word set and an unrelated high-frequency word set.
6. The method for screening noisy documents according to claim 5, wherein said segmenting the first sample and the second sample into words to obtain a second word set and a third word set further comprises:
and deleting stop words in the second word set and the third word set or deleting stop words in the related high-frequency word set and the unrelated high-frequency word set according to a preset stop word list.
7. The method for screening out noisy documents according to claim 1, wherein after obtaining the related class keyword table according to the seed word set, the keyword set and the preset related class high frequency word set, further comprising:
and deleting the stop words in the related key word list according to a preset stop word list.
8. The method for screening out noisy documents according to claim 1, wherein said obtaining related terms and irrelevant terms in said valid text according to said related keyword table and irrelevant keyword table, and calculating a score of said valid text according to corresponding keyword weights specifically comprises:
respectively acquiring related words and irrelevant words in each clause of an effective text according to the related keyword list and the irrelevant keyword list;
calculating the score of each clause according to the keyword weight of the related words and the keyword weight of the unrelated words in each clause;
and calculating the score of the effective text according to the proportion of each clause in the effective text and the score of each clause.
9. The method of claim 8, wherein before calculating the score of the valid text according to the specific gravity of each clause in the valid text and the score of each clause, the method further comprises:
respectively counting the number of characters of each clause to obtain the length of each clause;
counting the total number of words of the effective text to obtain the length of the effective text;
and respectively calculating the proportion of the length of each clause in the length of the effective text to obtain the proportion of each clause in the effective text.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of any one of claims 1-9.
CN201911398056.6A 2019-12-30 2019-12-30 Method for screening out noise document and computer readable storage medium Active CN111209737B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911398056.6A CN111209737B (en) 2019-12-30 2019-12-30 Method for screening out noise document and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911398056.6A CN111209737B (en) 2019-12-30 2019-12-30 Method for screening out noise document and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111209737A CN111209737A (en) 2020-05-29
CN111209737B true CN111209737B (en) 2022-09-13

Family

ID=70787737

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911398056.6A Active CN111209737B (en) 2019-12-30 2019-12-30 Method for screening out noise document and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111209737B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380868B (en) * 2020-12-10 2024-02-13 广东泰迪智能科技股份有限公司 Multi-classification device and method for interview destination based on event triplets

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968432A (en) * 2012-09-19 2013-03-13 华东师范大学 Control method for verifying tuple on basis of degree of confidence
WO2017101342A1 (en) * 2015-12-15 2017-06-22 乐视控股(北京)有限公司 Sentiment classification method and apparatus
CN106970991A (en) * 2017-03-31 2017-07-21 北京奇虎科技有限公司 Recognition methods, device and the application searches of similar application recommend method, server
CN107305768A (en) * 2016-04-20 2017-10-31 上海交通大学 Easy wrongly written character calibration method in interactive voice
CN108959418A (en) * 2018-06-06 2018-12-07 中国人民解放军国防科技大学 Character relation extraction method and device, computer device and computer readable storage medium
CN109062895A (en) * 2018-07-23 2018-12-21 挖财网络技术有限公司 A kind of intelligent semantic processing method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968432A (en) * 2012-09-19 2013-03-13 华东师范大学 Control method for verifying tuple on basis of degree of confidence
WO2017101342A1 (en) * 2015-12-15 2017-06-22 乐视控股(北京)有限公司 Sentiment classification method and apparatus
CN107305768A (en) * 2016-04-20 2017-10-31 上海交通大学 Easy wrongly written character calibration method in interactive voice
CN106970991A (en) * 2017-03-31 2017-07-21 北京奇虎科技有限公司 Recognition methods, device and the application searches of similar application recommend method, server
CN108959418A (en) * 2018-06-06 2018-12-07 中国人民解放军国防科技大学 Character relation extraction method and device, computer device and computer readable storage medium
CN109062895A (en) * 2018-07-23 2018-12-21 挖财网络技术有限公司 A kind of intelligent semantic processing method

Also Published As

Publication number Publication date
CN111209737A (en) 2020-05-29

Similar Documents

Publication Publication Date Title
Kannan et al. Preprocessing techniques for text mining
US8706474B2 (en) Translation of entity names based on source document publication date, and frequency and co-occurrence of the entity names
CN114065758B (en) Document keyword extraction method based on hypergraph random walk
CN108681574A (en) A kind of non-true class quiz answers selection method and system based on text snippet
CN102214189B (en) Data mining-based word usage knowledge acquisition system and method
CN110263319A (en) A kind of scholar's viewpoint abstracting method based on web page text
Balakrishnan et al. Improving document relevancy using integrated language modeling techniques
Yusuf et al. Query expansion method for quran search using semantic search and lucene ranking
JP2572314B2 (en) Keyword extraction device
Jia et al. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth
CN111209737B (en) Method for screening out noise document and computer readable storage medium
JPH0934905A (en) Key sentence extraction system, selection system and sentence retrieval system
CN107818078B (en) Semantic association and matching method for Chinese natural language dialogue
Doostmohammadi et al. Perkey: A persian news corpus for keyphrase extraction and generation
Nokkaew et al. Keyphrase extraction as topic identification using term frequency and synonymous term grouping
Benajiba et al. Arabic question answering
Nagy et al. Noun compound and named entity recognition and their usability in keyphrase extraction
Shrawankar et al. Construction of news headline from detailed news article
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method
Li et al. News-oriented automatic Chinese keyword indexing
Kaur et al. Keyword extraction for punjabi language
Zheng et al. Research on domain term extraction based on conditional random fields
CN114490941B (en) Chinese key phrase extraction method based on preloaded weight part-of-speech combination
SAMIR et al. AMAZIGH NAMED ENTITY RECOGNITION: A NOVEL APPROACH.
Panunzi et al. Keyword extraction in open-domain multilingual textual resources

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant