CN111209737B

CN111209737B - Method for screening out noise document and computer readable storage medium

Info

Publication number: CN111209737B
Application number: CN201911398056.6A
Authority: CN
Inventors: 王子玥; 章正道; 栾江霞; 徐晓文
Original assignee: Xiamen Meiya Pico Information Co Ltd
Current assignee: Xiamen Meiya Pico Information Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2022-09-13
Anticipated expiration: 2039-12-30
Also published as: CN111209737A

Abstract

The invention discloses a screening method of noise documents and a computer readable storage medium, wherein the method comprises the following steps: according to the seed word set, retrieving to obtain an original corpus; extracting effective texts from the original corpus; dividing sentences of the effective texts, and performing data cleaning; obtaining key words in the co-occurrence sentences to obtain a keyword set; obtaining a related class keyword list according to the seed word set, the keyword set and a preset related class high-frequency word set; respectively calculating the appearance proportion of each relevant word in the relevant class keyword table in the effective text as a key syntactic component to obtain the keyword weight of each relevant word; respectively calculating the weight of the key words of each irrelevant word; acquiring related words and irrelevant words in the effective text, and calculating the score of the effective text according to the corresponding weight of the keywords; and if the score is smaller than a preset threshold value, judging the text to be the noise text. The invention can eliminate irrelevant texts and improve the corpus quality of the search results.

Description

Method for screening out noise document and computer readable storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method for screening out noise documents and a computer-readable storage medium.

Background

With the rapid expansion and growth of network data, data search is a key way for people to extract required information from massive data. Related knowledge of news, comments and the like of the events of the objects of interest can be acquired by effectively setting search conditions and key fields. Meanwhile, the establishment of each closed-loop big data center also generally requires a data search service separated from the internet environment. Accurate data search based on semantic information can help people to obtain required information in a local environment, the requirement of data closed loop is guaranteed, meanwhile, high-quality search results are provided, and convenience is brought to data governance.

In the prior art, database retrieval content optimization is mainly divided into two directions, namely online optimization which is mainly based on a topological structure of a webpage link, such as PageRank and the like; and secondly, optimizing off-line data search results, which usually depends on a machine learning method to perform label training, dividing data into two types of related and noise samples, and performing training classification by using a support vector machine or Bayes and other methods. But the online algorithm depends on the links among the contents and the browsing tracks of the internet users, which are characteristic information which does not exist or can not be obtained in an offline database; the machine learning method training classification mainly has the characteristics of large labor consumption and poor generalization performance. The frequency of database searching is reduced by the organization personnel or the requirement of the searcher to label the data before searching the data, so that the man-machine efficiency is reduced.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: a method for screening out noise documents and a computer readable storage medium are provided, which can effectively remove noise corpora irrelevant to a target in a search result and retain corpora relevant to the search target.

In order to solve the technical problems, the invention adopts the technical scheme that: a method of screening out noisy documents, comprising:

retrieving according to a preset seed word set to obtain an original corpus;

extracting effective texts from the original corpus according to the format of the original corpus;

the effective texts are divided into sentences, and data cleaning is carried out on the effective texts;

performing word segmentation on the effective text, and performing part-of-speech recognition and syntactic analysis on each word obtained by word segmentation to obtain part-of-speech and syntactic components of each word;

acquiring a co-occurrence sentence containing at least one seed word from each clause of the effective text;

acquiring key words in the co-occurrence sentences according to preset key syntax components and key parts of speech to obtain a key word set;

obtaining a related class keyword list according to the seed word set, the keyword set and a preset related class high-frequency word set;

respectively calculating the appearance proportion of each relevant word in the relevant class keyword table as a key syntactic component in the effective text to obtain the keyword weight of each relevant word, wherein the keyword weight of each relevant word is a positive value;

obtaining an irrelevant keyword list according to a preset irrelevant high-frequency word set;

respectively calculating the appearance proportion of each irrelevant word in the irrelevant key word list as a key syntactic component in the effective text to obtain the keyword weight of each irrelevant word, wherein the keyword weight of each irrelevant word is a negative value;

acquiring related words and irrelevant words in the effective text according to the related keyword table and the irrelevant keyword table, and calculating the score of the effective text according to the corresponding keyword weight;

and if the score of the effective text is smaller than a preset threshold value, judging that the effective text is a noise text.

The invention also proposes a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps as described above.

The invention has the beneficial effects that: removing character-level noise information and language segments with less semantics or incomplete semantics by cleaning data; by carrying out sentence segmentation and word segmentation on the effective text, the subsequent analysis of each sentence and the matching of each word are facilitated; the relevance of key words is ensured by acquiring the key words from the co-occurrence sentences; the method comprises the steps that a seed word set, a keyword set and a preset related high-frequency word set are combined to obtain a related keyword list, and a group of related words of a basic coverage event can be formed; obtaining the keyword weights of the related words and the irrelevant words by calculating the appearance proportion of the related words and the irrelevant words in the effective text as key syntactic components, so that the weights are evaluated by scoring different positions, wherein part of the positions are higher, and the other positions are lower or zero; and calculating the score of the effective text according to the number of relevant words and irrelevant words hit by the effective text and the keyword weight of the relevant words and the irrelevant words, and finally judging whether the effective text is the noise text according to the score.

The method can be used for solving the problems of poor search results and more noise corpora in the database search according to the seed words; the keyword list is semantically expanded, so that coarse search data can be screened, irrelevant texts are removed, the corpus quality of search results is improved, and convenience is provided for data center management data.

Drawings

FIG. 1 is a flow chart of a method for screening out noise documents according to the present invention;

fig. 2 is a flowchart of a method according to a first embodiment of the invention.

Detailed Description

In order to explain technical contents, objects and effects of the present invention in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

The most key concept of the invention is as follows: extracting more keywords based on seed words by combining two-dimensional semantic features of part of speech information and syntax information and feature weights; extracting related high-frequency words and unrelated high-frequency words according to the related samples and the unrelated samples; taking the appearance proportion of the relevant words and the irrelevant words as key syntactic components as the corresponding keyword weight; and calculating text scores according to the relevant words and irrelevant words hit by the text and the weight of the keywords of the relevant words and irrelevant words, and judging the text type according to the scores.

Referring to fig. 1, a method for screening out a noise document includes:

retrieving according to a preset seed word set to obtain an original corpus;

respectively calculating the appearance proportion of each irrelevant word in the irrelevant keyword list as a key syntactic component in the effective text to obtain the keyword weight of each irrelevant word, wherein the keyword weight of each irrelevant word is a negative value;

acquiring related words and irrelevant words in the effective text according to the related key word table and the irrelevant key word table, and calculating the score of the effective text according to the corresponding weight of the key words;

From the above description, the beneficial effects of the present invention are: the method can be used for solving the problems of poor search results and more noise corpora in the database search according to the seed words.

Further, if the score of the valid text is smaller than a preset threshold, after determining that the valid text is a noise text, the method further includes:

and deleting the noise text.

From the above description, by deleting the noise text in the search result, the accuracy of the search result is improved.

Further, the sentence dividing and data cleaning of the effective text specifically includes:

according to a preset sentence break symbol, the effective text is divided into sentences;

filtering characters in the effective text according to a preset character blacklist, wherein the character blacklist comprises English symbols, English letters and Chinese symbols except for sentence break symbols;

and filtering the clauses in the effective text according to the preset length of the language segment.

From the above description, it can be known that the noise information at the character level in the effective text can be filtered, and meanwhile, the speech segments with less semantics or missing semantics can be filtered.

Further, the obtaining of the key words in the co-occurrence sentence according to preset key syntax components and key parts of speech specifically includes:

if the part of speech of a word in the co-occurrence sentence belongs to a preset key part of speech and the syntactic component of the word belongs to a preset key syntactic component, taking the word as a key word;

and obtaining key words in the co-occurrence sentences to obtain a keyword set.

Further, before obtaining the related class keyword list according to the seed word set, the keyword set and the preset related class high-frequency word set, the method further includes:

acquiring a preset first sample and a preset second sample, wherein the first sample is a sample related to the expected search content, and the second sample is a sample unrelated to the expected search content;

performing word segmentation on the first sample and the second sample respectively to obtain a second word set and a third word set;

and respectively obtaining the words with the highest word frequency in the second word set and the third word set and obtaining a related high-frequency word set and an unrelated high-frequency word set.

According to the description, the relevant high-frequency words and the irrelevant high-frequency words are extracted from the relevant samples and the irrelevant samples, so that the screening quality and manual controllability can be ensured.

Further, after the performing word segmentation on the first sample and the second sample to obtain a second word set and a third word set, the method further includes:

and deleting stop words in the second word set and the third word set or deleting stop words in the related high-frequency word set and the unrelated high-frequency word set according to a preset stop word list.

Further, after obtaining the related class keyword list according to the seed word set, the keyword set and the preset related class high-frequency word set, the method further includes:

and deleting the stop words in the related keyword list according to a preset stop word list.

According to the description, the stop words are deleted, so that the influence of the stop words on the recognition result of the subsequent text can be avoided, and the recognition accuracy is ensured.

Further, the obtaining of the related terms and the unrelated terms in the effective text according to the related keyword table and the unrelated keyword table, and calculating the score of the effective text according to the corresponding keyword weight specifically include:

respectively acquiring related words and irrelevant words in each clause of an effective text according to the related keyword list and the irrelevant keyword list;

calculating the score of each clause according to the weight of the keywords of the related words and the weight of the keywords of the unrelated words in each clause;

and calculating the score of the effective text according to the proportion of each clause in the effective text and the score of each clause.

As can be seen from the above description, the more related words contained in the valid text, the higher the score, and the more unrelated words contained, the lower the score.

Further, before calculating the score of the valid text according to the specific gravity of each clause in the valid text and the score of each clause, the method further includes:

respectively counting the number of characters of each clause to obtain the length of each clause;

counting the total number of words of the effective text to obtain the length of the effective text;

and respectively calculating the proportion of the length of each clause in the length of the effective text to obtain the proportion of each clause in the effective text.

As can be seen from the above description, the specific gravity is the ratio of the length of a single clause to the total length of the text.

Example one

Referring to fig. 2, a first embodiment of the present invention is: a method for screening out noise document can be used for screening out noise document in Internet information, and is suitable for providing search optimization in computer database system. The method comprises the following steps:

s1: according to a preset seed word set, original linguistic data are obtained by retrieving from a database; specifically, the seed words are used as search words, and corpus information which is successfully matched is obtained from a local database.

The method for obtaining the original corpus from the database is seed word logic matching search, and a group of seed words are connected through logic of 'AND', 'OR', so that the text is distinguished, for example, the 'A nation or C city' represents the text in which the 'A nation' or the 'C city' appears in the text.

In this embodiment, internet information is taken as an example, and the obtained corpus is mainly news, an online report and related comment information. Due to the fact that local database searching is conducted, different texts are obtained according to seed word matching and are irrelevant to pages, urls and the like where original data are located.

S2: and extracting effective texts from the original corpus according to the format of the original corpus. Namely, the original corpus is sorted, and different format contents are extracted and stored aiming at different types of texts.

Because news reports, social comments, netizen postings and the like all have different text formats, fine-grained effective information fields, such as a news title part, a text part, a netizen-released speech part, a forwarding part and the like, are extracted according to different formats. The embodiment mainly aims at the data sorting format of long texts such as news and internet texts and microblog short texts. The news title is crawled by a crawler according to the settings of different news webpages, and different pages are different. The microblog short texts are mainly sorted in format, the microblog platform is similar to other social platforms, the posts of the microblog short texts have strong format information, different symbols represent different meanings, and the microblog short texts are matched according to various formats so as to extract main sentence information of the texts.

Specifically, the effective text extracted in this embodiment includes a body title, body content, a forwarding title, and forwarding content. Firstly, extracting effective texts except the text contents from the original corpus according to a preset regular expression. E.g. according to "#? # ", extracting topic contents; according to "[ words? "extract the title content of the forwarded article; according "/@? And extracting the content of the forwarding account. The text content can then be obtained by deleting these contents other than the text content in the original corpus.

S3: and carrying out sentence segmentation on the effective text, and carrying out data cleaning on the effective text. In this embodiment, the data cleaning mainly removes punctuation and removes semantic less language segments.

Firstly, sentence boundary definition is carried out on the effective text according to preset punctuation marks, wherein the punctuation marks comprise commas, semicolons, periods, exclamation marks, question marks and long spaces, and each clause of the effective text is obtained.

Since there may be character-level noise information in the valid text, such as non-chinese characters, unrecognizable characters, redundant punctuation, etc., the characters in the valid text need to be filtered. Specifically, the characters belonging to a preset character blacklist in the valid text are removed, the character blacklist includes english symbols (symbols after 0x21 in acsii), english letters (capital letters and lowercase letters in ascii), and chinese symbols other than sentence break symbols, and may also include some characters which cannot be recognized. By removing the preset characters, the noise information at the character level in the effective text can be filtered.

Meanwhile, partial speech segments which have too little semantic information and cannot form sentences exist in the effective text, for example, the netizens forward the incomplete information of other people, and the speech segments which are composed of one word and two words for simply expressing emotion also need to be filtered. Specifically, the number of words in each clause of the valid text is counted as the length of each clause, and then clauses with lengths smaller than a preset paragraph length (e.g., 4) are deleted. And further deleting language sections in a preset format in the effective text, such as the top-up and the loss of the mail, the forwarding account name of the microblog text and the like.

S4: performing word segmentation on the effective text, and performing part-of-speech recognition and syntactic analysis on each word obtained by word segmentation to obtain part-of-speech and syntactic components of each word; further, the part-of-speech recognition and syntactic analysis and results are stored as features in a unified manner with corresponding words in the valid text. The method comprises the steps of performing word segmentation, part of speech recognition and syntactic analysis on the sorted corpus data, and storing each semantic information and effective text in a standardized manner, wherein the semantic information refers to the results of word segmentation, part of speech recognition and syntactic analysis.

S5: and acquiring a co-occurrence sentence containing at least one seed word from each clause of the effective text, namely if a clause in the effective text contains at least one seed word, the clause is a co-occurrence sentence.

For example, assuming that the seed word includes "nation a" and "city C", if "nation a" or "city C" appears in one sentence of the valid text, or both words appear at the same time, the sentence is considered as a co-occurrence sentence.

S6: and acquiring key words in the co-occurrence sentences according to preset key syntax components and key parts of speech to obtain a keyword set, namely acquiring words of which the parts of speech belong to the preset key parts of speech and the parts of speech belong to the preset key parts of speech in each co-occurrence sentence respectively, and adding the words into the keyword set.

The syntax component content includes seven types of subjects, predicates, objects, determiners, subjects, complements, and core verbs, and the key syntax component in the present embodiment includes a subject, a predicate, an object, and a core verb. The part of speech is the part of speech tagging standard of national standard 863, and the key part of speech of the embodiment comprises adjectives, morphemes, idioms, abbreviations, suffixes, numbers, general nouns, directional nouns, human names, organization names, place names, time nouns, other nouns, vocabularies and verbs.

S7: and acquiring a related high-frequency word set and an irrelevant high-frequency word set.

Specifically, a preset first sample and a preset second sample are obtained, wherein the first sample is a sample related to the expected search content, and the second sample is a sample unrelated to the expected search content; then, performing word segmentation on the first sample and the second sample respectively to obtain a second word set and a third word set; and respectively obtaining words with the highest word frequency preset number (such as 10) in the second word set and the third word set to obtain a related high-frequency word set and an unrelated high-frequency word set.

The first sample and the second sample can be randomly obtained from an original corpus, and then are manually marked, and the marked content is whether related to the expected search content or not; the already labeled sample may also be obtained additionally.

And further, after the second word set and the third word set are obtained, or after the related class high-frequency word set and the unrelated class high-frequency word set are obtained, deleting the stop words in the related class high-frequency word set and the unrelated class high-frequency word set according to a preset stop word list.

By acquiring the related high-frequency word set and the irrelevant high-frequency word set from the related samples and the irrelevant samples, the subsequent screening quality can be ensured. By deleting stop words such as auxiliary words, prepositions and the like, the influence of the stop words on the recognition result of the subsequent text can be avoided.

S8: obtaining a related class keyword list according to the seed word set, the keyword set and a preset related class high-frequency word set; the words in the three sets are collected to obtain a related class keyword list.

And further, deleting the stop words in the related keyword list according to a preset stop word list.

S9: and respectively calculating the appearance proportion of each relevant word in the relevant class keyword table as a key syntactic component in the effective text to obtain the keyword weight of each relevant word, wherein the keyword weight of each relevant word is a positive value.

For example, in a valid text, the related word "nation a" appears 4 times in total, 1 time being the subject and the other 3 times being the final phrase, and the keyword weight of "nation a" is 1/4 ═ 0.25.

S10: obtaining an irrelevant keyword list according to a preset irrelevant high-frequency word set; i.e. the set of irrelevant class high-frequency words is used as the irrelevant class keyword list.

S11: and respectively calculating the appearance proportion of each irrelevant word in the irrelevant keyword list as a key syntactic component in the effective text to obtain the keyword weight of each irrelevant word, wherein the keyword weight of each irrelevant word is a negative value.

For example, in a valid text, the irrelevant word "nation B" appears 4 times in total, with 1 time as the subject and 3 times as the final phrase, and then the keyword weight of "nation B" is-1/4 ═ 0.25.

S12: and acquiring related words and irrelevant words in the effective text according to the related keyword list and the irrelevant keyword list, and calculating the score of the effective text according to the corresponding keyword weight.

Specifically, related words and irrelevant words in each clause of an effective text are respectively obtained according to the related keyword list and the irrelevant keyword list; calculating the score of each clause according to the keyword weight of the related words and the keyword weight of the unrelated words in each clause; and calculating the score of the effective text according to the specific gravity of each clause in the effective text and the score of each clause. And calculating the proportion of the length (number of characters) of each clause in the total length of the effective text to obtain the proportion of each clause in the effective text.

This step can be calculated according to the following formula:

wherein S is the valid text and S is the valid textLength of text, S _i Is the ith clause of S, | S _i L is the length of the ith clause, j is S _i The total number of related words and irrelevant words in the text can be called S _i Hit j keywords, p _j I.e. the keyword weight of the jth related word or irrelevant word.

As can be seen from the above description, in a clause, the more relevant words hit, the higher the score, and the more irrelevant words hit, the lower the score. And summing the scores of all the clauses to obtain the score of the whole sample. When more relevant words are hit in each clause of sample S and less irrelevant words are hit, the higher the score of sample S, the more likely it is not a noisy text.

S13: and judging whether the score of a valid text is smaller than a preset threshold value, if so, executing the step S14, and if not, regarding the valid text as a related sample. Wherein, the threshold value can be adjusted according to the actual situation.

When all clauses hit at most 1 keyword, score has an upper limit of 1 and a lower limit of-1; a preferred threshold is 0.5, i.e. typically more than half of the clauses hit the relevant word on average, i.e. the current sample is considered not to belong to noisy text.

S14: and judging the effective text as a noise text, and deleting the effective text. And deleting the noise text, wherein the remaining effective text is the related text, so that the problem of more noise linguistic data in the search result can be improved.

In the embodiment, through data cleaning, character-level noise information and language segments with less semantics or incomplete semantics are removed; by carrying out sentence segmentation and word segmentation on the effective text, the subsequent analysis of each sentence and the matching of each word are facilitated; the relevance of key words is ensured by acquiring the key words from the co-occurrence sentences; by extracting the related high-frequency words and the unrelated high-frequency words from the related samples and the unrelated samples, the screening quality and manual controllability can be ensured; the method comprises the steps that a seed word set, a keyword set and a preset related high-frequency word set are combined to obtain a related keyword list, and a group of related words of a basic coverage event can be formed; by deleting the stop words, the influence of the stop words on the recognition result of the subsequent text can be avoided, and the recognition accuracy is ensured; the related words and the irrelevant words are used as the appearance proportion of key syntactic components in the effective text and are used as the weight of the keywords of the related words and the irrelevant words, so that the weight is evaluated by scoring different positions, part of the positions are higher, and the other positions are lower or zero; and calculating the score of the effective text according to the number of relevant words and irrelevant words hit by the effective text and the keyword weight of the relevant words and the irrelevant words, and finally judging whether the effective text is a noise text or not according to the score.

According to the embodiment, the noise corpora irrelevant to the target in the search result can be removed, and the corpora relevant to the search target are reserved, so that the problems of poor search result and more noise corpora in the database search according to the seed words can be solved; the keyword list is semantically expanded, so that coarse search data can be screened, irrelevant texts are removed, the corpus quality of search results is improved, and convenience is provided for data center management data.

Example two

The present embodiment is a computer-readable storage medium corresponding to the above-mentioned embodiments, on which a computer program is stored, the program, when executed by a processor, implementing the steps of:

retrieving according to a preset seed word set to obtain an original corpus;

Further, if the score of the valid text is smaller than a preset threshold, after the valid text is determined to be a noise text, the method further includes:

and deleting the noise text.

and obtaining key words in the co-occurrence sentences to obtain a keyword set.

Further, after the word segmentation is performed on the first sample and the second sample respectively to obtain a second word set and a third word set, the method further includes:

In summary, the method for screening out noise documents and the computer-readable storage medium provided by the present invention can remove the noise corpora irrelevant to the target in the search result, and retain the corpora relevant to the search target, so as to solve the problems of poor search result and more noise corpora in the database search according to the seed words; the keyword list is semantically expanded, so that coarse search data can be screened, irrelevant texts are removed, the corpus quality of search results is improved, and convenience is provided for data center management data.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to the related technical fields, are included in the scope of the present invention.

Claims

1. A method of screening noisy documents, comprising:

retrieving to obtain an original corpus according to a preset seed word set;

respectively calculating the appearance proportion of each relevant word in the relevant keyword list as a key syntactic component in the effective text to obtain the keyword weight of each relevant word, wherein the keyword weight of each relevant word is a positive value;

2. The method for screening out noisy documents according to claim 1, wherein if the score of the valid text is smaller than a preset threshold, the method further comprises, after determining that the valid text is a noisy text:

and deleting the noise text.

3. The method for screening out noisy documents according to claim 1, wherein the dividing the sentences into the valid texts and the data cleansing of the valid texts are specifically:

4. The method for screening out noise documents according to claim 1, wherein the key words in the co-occurrence sentences are obtained according to preset key syntax components and key parts of speech, and the obtained key word set specifically comprises:

and obtaining key words in the co-occurrence sentences to obtain a keyword set.

5. The method for screening out noisy documents according to claim 1, wherein before obtaining the related class keyword list according to the seed word set, the keyword set and the preset related class high frequency word set, further comprising:

6. The method for screening noisy documents according to claim 5, wherein said segmenting the first sample and the second sample into words to obtain a second word set and a third word set further comprises:

7. The method for screening out noisy documents according to claim 1, wherein after obtaining the related class keyword table according to the seed word set, the keyword set and the preset related class high frequency word set, further comprising:

and deleting the stop words in the related key word list according to a preset stop word list.

8. The method for screening out noisy documents according to claim 1, wherein said obtaining related terms and irrelevant terms in said valid text according to said related keyword table and irrelevant keyword table, and calculating a score of said valid text according to corresponding keyword weights specifically comprises:

calculating the score of each clause according to the keyword weight of the related words and the keyword weight of the unrelated words in each clause;

9. The method of claim 8, wherein before calculating the score of the valid text according to the specific gravity of each clause in the valid text and the score of each clause, the method further comprises:

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of any one of claims 1-9.