CN110110195B

CN110110195B - Impurity removal method and device

Info

Publication number: CN110110195B
Application number: CN201910374893.9A
Authority: CN
Inventors: 宋小亮; 鄢军
Original assignee: Puxin Hengye Technology Development Beijing Co ltd; Yiren Hengye Technology Development Beijing Co ltd
Current assignee: Puxin Hengye Technology Development Beijing Co ltd; Yiren Hengye Technology Development Beijing Co ltd
Priority date: 2019-05-07
Filing date: 2019-05-07
Publication date: 2022-05-17
Anticipated expiration: 2039-05-07
Also published as: CN110110195A

Abstract

According to the impurity removing method and device, data information to be processed is divided into at least one paragraph, at least one index value of a preset index which can be used for representing the impurity degree of the paragraph is determined based on the paragraph characteristics, then whether the paragraph is an impurity paragraph is further determined according to the at least one index value, and the data information is subjected to impurity paragraph removing processing based on the determination result, so that target data information which does not include the impurity paragraph is finally obtained. Therefore, the technical scheme for determining whether the paragraph is the impurity paragraph or not and further clearing the impurity paragraph based on the paragraph features of the data information is provided, the structure of the data information does not need to be analyzed, the failure or inaccurate grabbing of the effective information content of the data information cannot be caused by the structural upgrade or change of the data information, and higher labor cost is not needed, so that the impurities in the data information are cleared efficiently, accurately and cheaply.

Description

Impurity removal method and device

Technical Field

The invention belongs to the technical field of information extraction and filtration, and particularly relates to an impurity removing method and device.

Background

In the internet era of information explosion, a web crawler captures massive article data basically every day and provides information service through an article publishing platform, however, the quality of the captured articles is often uneven and may be mixed with foreign information such as promotion information, operation advertisements, external links, useless links and pictures for the articles, and the reading experience of a user is greatly influenced, so that the foreign information in the web articles needs to be removed, and a comfortable reading environment and reading experience are provided for the user.

At present, the impurities such as promotion information, operation advertisements and the like in the network articles are removed by adopting a condition-based content selection mode or a manual later intervention mode. In a condition-based content selection mode, identification Information (ID) or a Name (Name) capable of uniquely identifying 'target content' to be captured is determined by generally analyzing an article structure of a website to be captured, and then the determined identification information or the target content corresponding to the Name is captured by using a capture program, so that impurity information such as promotion information and operation advertisements which do not need to be captured is eliminated and filtered; in the manual later intervention mode, after articles are captured, impurity information in the articles needs to be determined and removed in a manual review mode, the mode necessarily needs higher labor cost, the efficiency is low, and the requirement for quickly and efficiently removing the impurities of the articles in a capturing and releasing scene of the articles cannot be met.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for removing impurities, so as to remove impurities, such as promotion information and operation advertisements, from data information, such as network articles, with high efficiency, high accuracy, and low cost.

Therefore, the invention discloses the following technical scheme:

an impurity removal method comprising:

acquiring data information to be processed;

splitting the data information into at least one paragraph;

determining an index value of at least one predetermined index of the paragraph based on the paragraph features of the paragraph, wherein the index value of each index of the paragraph can be used for characterizing the impurity degree of the paragraph;

determining whether the paragraph is an impurity paragraph according to the index value of the at least one preset index of the paragraph to obtain an impurity paragraph determination result;

and based on the impurity paragraph determination result, carrying out impurity paragraph clearing processing on the data information to obtain target data information which does not include impurity paragraphs.

The above method, preferably, the splitting the data information into at least one paragraph includes:

and splitting the data information by taking a label of a preset mark language in the data information as a paragraph splitting basis to obtain at least one paragraph.

The above method, preferably, the determining an index value of at least one predetermined index of the paragraph includes:

and determining at least one of length weight corresponding to the length of the paragraph, position weight corresponding to the position of the paragraph in the data information, and matching degree information of words in the paragraph and impurity feature words included in each preset impurity feature phrase template.

Preferably, the determining information of the matching degree between the words in the paragraphs and the impurity feature words included in the predetermined impurity feature phrase templates includes:

extracting participles and/or keywords in the paragraphs; and performs the following processing:

matching the participles in the paragraphs with the characteristic participles included in each impurity characteristic phrase template; calculating the feature word segmentation matching degree of the paragraph and each impurity feature phrase template based on the matching number of the feature words included in the paragraph and each impurity feature phrase template and the total number of the feature words in the impurity feature phrase template; selecting the characteristic word segmentation matching degree with the largest value as the impurity characteristic word segmentation weight of the paragraph;

and/or the presence of a gas in the atmosphere,

matching the keywords in the paragraphs with the feature keywords included in each impurity feature phrase template; calculating the matching degree of the characteristic keywords of the paragraph and each impurity characteristic phrase template based on the matching number of the paragraph and the characteristic keywords included in each impurity characteristic phrase template and the total number of the characteristic keywords in the impurity characteristic phrase template; and selecting the characteristic keyword matching degree with the largest value as the impurity characteristic keyword weight of the paragraph to obtain the matching degree information comprising the impurity characteristic word segmentation weight and/or the impurity characteristic keyword weight.

In the method, preferably, the participles or keywords in the paragraph are matched with the feature participles or feature keywords included in the impurity feature phrase template, and are any one of the following:

the words of the participles or the keywords in the paragraphs are matched with the words of the corresponding characteristic participles or the characteristic keywords in the impurity characteristic phrase template;

the participles or keywords in the paragraph are matched with the corresponding characteristic participles or characteristic keywords in the impurity characteristic phrase template in sequence.

The above method, preferably, the determining whether the paragraph is an impurity paragraph according to the index value of the at least one predetermined index of the paragraph includes:

judging whether the index value of the at least one preset index of the paragraph meets a preset impurity paragraph condition, if so, determining that the paragraph is an impurity paragraph, and if not, determining that the paragraph is not an impurity paragraph; the impurity paragraph condition comprises a value requirement for the at least one predetermined index;

alternatively, the first and second electrodes may be,

and calculating the impurity degree of the paragraph based on the index value of the at least one predetermined index of the paragraph, and judging whether the impurity degree of the paragraph reaches a predetermined impurity degree threshold value, wherein if the impurity degree of the paragraph reaches the predetermined impurity degree threshold value, the paragraph is an impurity paragraph, and if the impurity degree of the paragraph does not reach the predetermined impurity degree threshold value, the paragraph is not an impurity paragraph.

Preferably, the method, based on the determination result of the impurity section, of performing impurity section removal processing on the data information to obtain target data information not including an impurity section, includes:

and recombining each non-impurity paragraph in the data information to obtain the target data information without the impurity paragraph.

The above method, preferably, before determining an index value of at least one predetermined index of a paragraph based on the paragraph feature of the paragraph, further includes:

and filtering out any one or more items of information in numbers, letters and dynamic scripts in the paragraphs.

An impurity removing device comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring data information to be processed;

the splitting unit is used for splitting the data information into at least one paragraph;

a first determining unit, configured to determine an index value of at least one predetermined index of a paragraph based on a paragraph feature of the paragraph, where the index value of each index of the paragraph can be used to characterize a degree of impurity of the paragraph;

the second determining unit is used for determining whether the paragraph is an impurity paragraph according to the index value of the at least one preset index of the paragraph to obtain an impurity paragraph determining result;

and the clearing processing unit is used for carrying out clearing processing on the impurity paragraphs on the data information based on the impurity paragraph determination result so as to obtain target data information which does not include the impurity paragraphs.

The above apparatus, preferably, the first determining unit is specifically configured to:

In the above apparatus, preferably, the determining, by the first determining unit, matching degree information between a word in the paragraph and impurity feature words included in each predetermined impurity feature phrase template specifically includes:

and/or the presence of a gas in the gas,

The above apparatus, preferably, the second determining unit is specifically configured to:

judging whether the index value of the at least one preset index of the paragraph meets a preset impurity paragraph condition, if so, judging that the paragraph is an impurity paragraph, and if not, judging that the paragraph is not an impurity paragraph; the impurity paragraph condition comprises a value requirement for the at least one predetermined index;

alternatively, the first and second electrodes may be,

The above apparatus, preferably, further comprises:

and the filtering unit is used for filtering any one or more items of information in numbers, letters and dynamic scripts in the paragraphs before determining the index value of at least one predetermined index of the paragraphs.

According to the above scheme, the impurity removing method and apparatus provided by the application divide the data information to be processed into at least one paragraph, determine at least one index value of a predetermined index, which can be used for characterizing the impurity degree of the paragraph, of the paragraph based on the paragraph features, then further determine whether the paragraph is an impurity paragraph according to the at least one index value, and perform impurity paragraph removing processing on the data information based on the determination result, so as to finally obtain the target data information without the impurity paragraph. Therefore, the technical scheme for determining whether the paragraph is the impurity paragraph or not based on the paragraph characteristics of the data information and further clearing the impurity paragraph is provided, the structure of the data information (such as the article structure of a website) does not need to be analyzed, the failure or inaccuracy of grabbing of the effective information content of the data information due to the structural upgrade or change of the data information is avoided, and higher labor cost is not needed, so that the impurities in the data information are cleared efficiently, with high accuracy and at low cost.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of an impurity removal method according to an embodiment of the present disclosure;

FIG. 2 is a logic diagram of a process for determining whether a paragraph is a contaminant paragraph based on a predetermined contaminant paragraph condition according to an embodiment of the present disclosure;

fig. 3 is another schematic flow chart of an impurity removal method according to an embodiment of the present disclosure;

fig. 4 is a logic diagram of a process of removing impurity information in a web article according to a scheme of the present application provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an impurity removing apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of another impurity removing device according to an embodiment of the present application.

Detailed Description

For the sake of reference and clarity, the technical terms, abbreviations or abbreviations used hereinafter are to be interpreted in summary as follows:

TF-IDF algorithm: is used to evaluate the importance of a word to one of a set of documents or a corpus of documents. The importance of a word increases in proportion to the number of times it appears in the one document, but at the same time decreases in inverse proportion to the increase in the frequency with which it appears throughout the corpus or corpus of documents. Various forms of TF-IDF weighting are often applied by search engines for the measurement or rating of the degree of relevance between a document and a user query. The corpus adopted in the algorithm is resource training based on 2014-year daily statement corpora of people, such as Qinghua university, Chinese academy and the like.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The application provides a method and a device for removing impurities, which aim to remove impurities in data information with high efficiency, high accuracy and low cost, are suitable for but not limited to scenes of identifying and filtering impurity information such as promotion information, operation advertisements, external links and the like in network articles when the network articles are captured, and are explained in detail through specific embodiments.

Referring to fig. 1, a flowchart of an impurity removing method according to an embodiment of the present application is provided, in this embodiment, as shown in fig. 1, the impurity removing method may include the following processing steps:

step 101, obtaining data information to be processed.

The data information to be processed may be, but is not limited to, a web article captured from a network based on a crawler technology, and for data information such as an originally captured web article, the originally captured web article may be often mixed with foreign information such as promotion information, operation advertisements, external links, and the like, which may correspondingly affect the reading experience of a user.

Next, the present application will mainly take the data information as a web article as an example to describe a processing procedure of the scheme of the present application.

In this step 101, the acquiring of the data information to be processed may be, during specific implementation, acquiring batch data information in a centralized manner at one time, and on this basis, performing impurity removal processing on each piece of data information in the batch data information through a subsequent processing process according to the present application scheme, or may also be, acquiring only one piece of data information at a time, and performing impurity removal processing on one piece of data information through a subsequent processing process each time, and then, after completing the impurity removal processing on the current piece of data information, acquiring the next piece of data information in sequence and performing impurity removal processing on the next piece of data information. The embodiment does not limit the way of acquiring data information and performing impurity removal processing on the data information in the specific implementation.

The specific content of the piece of data information depends on a specific application scenario, taking a capture scenario of a network article as an example, the piece of data information refers to a network article which may carry impurity information and is captured from a network based on technologies such as a crawler and the like.

Step 102, splitting the data information into at least one paragraph.

The data information can be split by taking a label of a predetermined markup language in the data information as a paragraph splitting basis to obtain at least one paragraph.

More specifically, in the case that the data information is a web article, the captured web article is usually in an html (Hyper Text Markup Language) format, so that html tags in the web article, such as < p >, < br >, and < div >, can be used as paragraph splitting bases to split the web article to obtain paragraphs of the web article. For convenience of subsequent processing, preferably, the paragraphs obtained by splitting may be stored in a data set according to their original order in the network article, so that a paragraph set may be obtained:

Paragraphs＝{P₁，P₂，..，P_N}；

wherein, P_kThe kth paragraph (k is more than or equal to 1 and less than or equal to N, k is a natural number) of the network article, N represents the number of paragraphs included in the network article, P represents the number of paragraphs included in the network article_kIt may be a valid information section of the network article, i.e. a non-impurity section, and may also be an impurity section.

Step 103, determining an index value of at least one predetermined index of the paragraph based on the paragraph features of the paragraph, wherein the index value of each index of the paragraph can be used for characterizing the impurity degree of the paragraph.

In this embodiment, based on the paragraph feature of the paragraph, the index value of the at least one predetermined index of the determined paragraph may include, but is not limited to, any one or more of the following indexes:

length weight corresponding to length of paragraph (A)

For the network article, the number of words of paragraphs of valid information content is usually within the interval [80, 300], if the number of words is too small to express clear information content, and the number of words of promotion information, advertisement information and the like is often small, in view of this, a threshold value of the number of words may be set, which may be exemplarily 80, wherein, if a paragraph has a number of words smaller than the threshold value, the paragraph is considered as a suspected impurity paragraph, otherwise, if the number of words is larger than the threshold value, the paragraph is considered as a suspected valid information paragraph, and a corresponding length weight may be assigned to the paragraph based on the comparison of the number of words and the threshold value.

It should be noted that the threshold value provided in this embodiment is only an exemplary but non-limiting illustration of the word count threshold value provided in this application, and in a specific implementation, the setting of the word count threshold value is flexible, and can be set by a technician based on actual statistics of the number of words of paragraphs and the number of words of miscellaneous paragraphs of the effective information content of the web page, and is not necessarily limited to the above value provided in this embodiment.

(II) position weight corresponding to position of paragraph appearing in the data information

In view of the reasons of the article publishing platform, the impurity information such as advertisements, promotion information, external links, etc. may appear at any position of the web article, such as the beginning part, middle part, end part, etc. of the article, but usually, these impurity information are located at some part/parts of each part of the article, for example, at the end part of the article, etc., so the appearance position of the paragraph in the web article may also be used as an important factor for measuring/estimating whether the paragraph is an impurity paragraph.

In view of this, in the present embodiment, the position weight corresponding to the position where the paragraph appears in the data information such as the web article is used as an index for identifying whether the paragraph is a foreign paragraph, and taking the example that the foreign paragraph is located at the end of the article, for example, the position weight corresponding to the position where the paragraph appears in the web article may be calculated in the following manner:

wherein the Position_iThe position weight of the ith paragraph in the web article is represented, and atl (absolute total length) represents the total length (i.e. total number of words) of the web article, and pl is the length of the paragraph.

(III) matching degree information of words in paragraphs and impurity feature words included in predetermined impurity feature phrase templates

The foreign feature words included in the words or foreign feature phrase templates in the paragraphs may be, but are not limited to, segmented words and/or keywords.

The method comprises the steps that foreign information such as promotion information, operation advertisements and external links often contains more vivid magazine characteristic phrases, such as ' more contents ', ' similar content viewing ', ' more links query ', ' I want to feed back ', ' view the whole text ', ' share ', text related recommendations ' and the like.

It should be noted that, in an actual application scenario, the content in the template may not necessarily appear in the impurity paragraphs, but may also appear in the valid information paragraphs of the article, for example, the "more content" appearing in an article may be only a part of a certain sentence in the valid information paragraph, but not a part of the impurity paragraph, and for this case, the index of each of the other factors (such as the paragraph length, the paragraph position, and the like) provided by this embodiment may be combined to accurately identify whether the paragraph is the impurity paragraph, so as to prevent misjudgment.

In specific implementation, a plurality of web articles can be crawled in advance based on a crawler technology, each impurity feature phrase template is formulated in advance based on feature phrases of impurity paragraphs in the crawled web articles, feature word segmentation and/or feature keyword extraction is performed on each impurity feature phrase template respectively, feature word segmentation and/or feature keyword sequences of each impurity feature phrase template can be obtained correspondingly, and by taking feature word segmentation and feature keyword extraction of the impurity feature phrase templates as examples, extraction processing results can be expressed as follows:

wherein dirty _ words represents an impurity feature phrase template set, n represents the number of impurity feature phrase templates, and is usually a natural number greater than 1; d _ work_iRepresenting the characteristic participle of the ith impurity characteristic phrase template, and mi representing the number of the characteristic participles of the ith impurity characteristic phrase template; d _ tag_iAnd the characteristic keywords represent the characteristic keywords of the ith impurity characteristic phrase template, ti represents the number of the characteristic keywords of the ith impurity characteristic phrase template, i is more than or equal to 1 and less than or equal to n, and i is a natural number.

On the basis of the preset impurity feature phrase template, for data information to be processed, such as a network article to be processed and the like, after the paragraph of the network article is split and corresponding to each paragraph of the network article is obtained, word segmentation and/or keyword extraction can be further performed on the data information, wherein optionally, when the word segmentation and keyword extraction is performed, word segmentation or keyword part of speech can be simultaneously extracted, the extracted part of speech can be one of 12 parts of speech specified in the modern Chinese, and the 12 parts of speech specified in the modern Chinese specifically include: 6 kinds of real words: noun, verb, adjective, number, quantifier, and pronoun, and the null word in 6: adverbs, prepositions, conjunctions, helpwords, sighs, and vocabularies.

Specifically, the corresponding word segmentation tool may be used to perform word segmentation extraction on the paragraphs, and for the kth paragraph, taking the simultaneous extraction of the word segmentation and the word segmentation part of the kth paragraph as an example, the word segmentation and part of speech sequence may be obtained:

words_kM1＝{(word_k1，pos_k1)，(word_k2，pos_k2)，...(word_kM1，pos_kM1)}；

wherein, words_kM1Word (and part of speech) sequence representing the k-th paragraph, word_kj1J1 participle, pos, representing the kth paragraph_kj1The part of speech of the j1 th participle of the kth paragraph is represented, M1 represents the number of the participles of the kth paragraph, k is more than or equal to 1 and less than or equal to N, j1 is more than or equal to 1 and less than or equal to M1, and k and j1 are natural numbers.

The method can be used for extracting keywords in paragraphs by adopting a TF-IDF (Term Frequency-Inverse Document Frequency, a commonly-used weighting technique for information retrieval data mining), and for the kth paragraph, a keyword sequence can be obtained by keyword extraction:

tags_kM2＝{tag_k1，tag_k2，..，tag_kM2}；

wherein, words_kM1Keyword sequence, tagk, representing the kth paragraph_kj2The number of the keywords of the j2 th paragraph is represented by M2, k is more than or equal to 1 and less than or equal to N, j2 is more than or equal to 1 and less than or equal to M2, and k and j2 are natural numbers.

On the basis of extracting the participles and/or keywords of the paragraphs, matching degree information of the participles and/or keywords and other words in the paragraphs and impurity feature words (feature participles and/or feature keywords) included in predetermined impurity feature phrase templates can be further determined, wherein the matching degree information specifically includes:

matching the participles in the paragraphs with the characteristic participles included in each impurity characteristic phrase template; calculating the feature word segmentation matching degree of the paragraph and each impurity feature phrase template based on the matching number of the feature words included in the paragraph and each impurity feature phrase template and the total number of the feature words in the impurity feature phrase template; selecting the characteristic word segmentation matching degree with the largest value as the impurity characteristic word segmentation weight of the paragraph, namely:

word_weight＝max(words_match/words_size)；

wherein word _ weight represents the impurity feature segmentation weight of a paragraph, word _ match represents the matching number of feature segmentation included in a paragraph and an impurity feature phrase template, and word _ size represents the number of feature segmentation included in the impurity feature phrase template;

and/or the presence of a gas in the gas,

matching the keywords in the paragraphs with the feature keywords included in each impurity feature phrase template; calculating the matching degree of the characteristic key words of the paragraph and each impurity characteristic phrase template based on the matching number of the paragraph and the characteristic key words included in each impurity characteristic phrase template and the total number of the characteristic key words in the impurity characteristic phrase template; selecting the characteristic keyword matching degree with the largest value as the impurity characteristic keyword weight of the paragraph, namely:

tag_weight＝max(tags_match/tags_size)；

wherein tag _ weight represents the weight of the impurity feature keywords of a paragraph, tag _ match represents the matching number of the feature keywords included in a paragraph and an impurity feature phrase template, and tag _ size represents the number of the feature keywords included in the impurity feature phrase template.

The keyword weight reflects the degree of the occurrence of the feature keywords in the paragraphs of the article, and the keywords and the participles are extracted from the words in the paragraphs, but the difference is that the keyword extraction strength is coarser, for example, the phrase "you may also be interested in" in the paragraphs, and the extracted keyword results: "interest", the result of the proposed word segmentation: "you", "may", "also", "interesting".

Further, the matching of the segmentation word or the keyword in the paragraph with the feature segmentation word and/or the feature keyword included in the impurity feature phrase template may be any one of the following:

1) the self words of the participles or keywords in the paragraphs are matched with the self words of the corresponding characteristic participles or characteristic keywords in the impurity characteristic phrase templates.

In this case, only the self-word of the participle or keyword in the paragraph is required to match the self-word of the corresponding feature participle or feature keyword in the foreign feature phrase template, and no matching in terms of word order or part of speech is required.

2) The participles or keywords in the paragraph are matched with the corresponding characteristic participles or characteristic keywords in the impurity characteristic phrase template in sequence.

In this case, matching of the words themselves and word sequences is required, and both the matching requirements are satisfied to indicate that the participles or keywords in the paragraph are matched with the corresponding feature participles or feature keywords in the foreign feature phrase template, for example, if the set foreign feature phrase template is "view full text information", if a certain paragraph of the web article appears "xxx is mentioned multiple times in the full text, and xxx is edited and viewed, since the" full text "in the paragraph" is before "and" viewed "is after, the relative sequence of the participles" viewed "and" full text "in the foreign feature phrase template is not matched, the" full text "and" viewed "in the paragraph are not matched with the feature participles or feature keywords" viewed "in the foreign feature phrase template, and if a certain paragraph of the web article appears" viewed full text content. ", the "view" and "full text" in this paragraph match the "view" and "full text" in the template described above, respectively.

In addition, for the matching manners of the above 1) and 2), the matching requirement in terms of part of speech may also be considered, that is, the part of speech of the segmented word or the keyword in the paragraph is simultaneously required to match with the part of speech of the corresponding feature segmented word or the feature keyword in the impurity feature phrase template.

In a specific implementation, a technician may freely set a matching manner of the feature segmentation or the feature keyword in each impurity feature phrase template according to an actual requirement, for example, set to adopt the 1 st) matching manner, or set to adopt the 2 nd) matching manner, or set to adopt the 1 st/2 nd) matching manner, and set a part-of-speech matching requirement at the same time.

And step 104, determining whether the paragraph is an impurity paragraph according to the index value of the at least one preset index of the paragraph to obtain an impurity paragraph determination result.

After the index value of the at least one predetermined index of the paragraph is determined, it may be determined whether the paragraph is an impurity paragraph based on the index value of the at least one predetermined index.

As a possible implementation manner, a (or a set of) impurity section condition may be preset, and the impurity section condition may include a value requirement for the at least one predetermined index. On the basis, whether the index value of the at least one preset index of the paragraph meets the preset impurity paragraph condition is judged, if yes, the paragraph is an impurity paragraph, and if not, the paragraph is not an impurity paragraph.

By way of example and not limitation, the following impurity paragraph conditions may be specifically set:

condition 1: (word _ match ≦ 2 and all matches) or (word _ match > 2 and word _ weight > 0.1) and tag _ weight > 0.5;

condition 2: position_i> 0.75 and tag _ weight > 0.5 and word _ weight > 0.6;

condition 3: position_i> 0.85 and tag _ weight > 0.85 and word _ weight > 0.4;

when the ith paragraph satisfies: if the condition 1 is satisfied (the condition 2 is satisfied or the condition 3 is satisfied), the segment is considered to be an impurity segment; otherwise, if not, the paragraph is not considered to be an impurity paragraph.

As shown in fig. 2, a schematic diagram of the processing procedure for determining whether a paragraph is an impurity paragraph based on the above conditions in this example is provided.

As another possible implementation manner, the method may further calculate the degree of impurity of the paragraph by combining with the index value of the at least one predetermined index of the paragraph, determine whether the degree of impurity of the paragraph reaches a predetermined threshold value of the degree of impurity, and if so, determine that the paragraph is an impurity paragraph; otherwise, if not, the section is judged not to be the impurity section.

And 105, based on the impurity paragraph determination result, performing impurity paragraph removal processing on the data information to obtain target data information not including impurity paragraphs.

On the basis of determining whether each paragraph in the data information such as the web article is a foreign paragraph, the data information may be further subjected to a removal process of the foreign paragraph based on the determination result of the foreign paragraph.

Specifically, but not limited to, by recombining each non-impurity section (i.e., valid information section) in the data information, the clarity of each impurity section is achieved, and the target data information without the impurity section is finally obtained.

When each non-impurity paragraph in the data information is recombined, taking the data information as a network article as an example, the relative position or relative sequence of each non-impurity paragraph in the network article can be referred to, and each non-impurity paragraph in the network article is recombined in sequence, so that the finally obtained article without the impurity information can maintain the original position relationship of each non-impurity paragraph in the original network article.

Based on the above statements, the present embodiment provides a technical solution for determining whether a paragraph is an impurity paragraph based on the paragraph features of the data information itself and further clearing the impurity paragraph, without analyzing the structure of the data information (such as the article structure of a website), without failure or inaccurate grabbing of effective information content due to upgrading or changing the structure of the data information, and without high human cost, thereby achieving efficient, high-accuracy and low-cost clearing of impurities in the data information.

In other embodiments of the present application, referring to another flowchart of the impurity removing method shown in fig. 3, before the step 103, the method may further include:

step 102': and filtering out any one or more items of information in numbers, letters and dynamic scripts in the paragraphs.

The paragraphs of the network article may include numeric or alphabetical information, such as a numeric string including an identification number, or an alphabetical string including one or more english words in a chinese paragraph, and the numeric or alphabetical information may generally count a single number (e.g., "1", "2") or a single letter (e.g., "a", "b") as a word number when performing paragraph number statistics (e.g., counting words of paragraphs when calculating an impurity feature participle weight or an impurity feature keyword weight of a paragraph), which does not match with the actual situation of a word represented by an identification number or an alphabetical string represented by a numeric string, and may affect the statistics of the paragraph length, thereby causing interference with the determination result of the impurity paragraph.

In addition, a paragraph of the web article may further include a dynamic script (Javascript), and similarly, the dynamic script may also affect statistics on the length of the paragraph (the dynamic script generally includes a plurality of english words), and at the same time, since the dynamic script includes an html tag, splitting of the paragraph may be affected.

In view of this, in the embodiment, before determining the index value of the at least one predetermined index of the paragraph, any one or more of the numbers, letters, and dynamic scenarios in the paragraph are filtered out, so as to eliminate the interference in the determination process of the impurity paragraph.

For the case of the dynamic scenario, the dynamic scenario may be specifically identified based on the < script > tag, and the content of the dynamic scenario between the < script > tags is removed.

Referring to fig. 4, fig. 4 provides a schematic diagram of processing logic for performing a clearing process on the foreign information in the web article based on the scheme of the embodiment, and based on the processing procedure, the web article without the foreign information such as promotion information, advertisement, external link, etc. can be finally obtained.

Based on the scheme of the embodiment, the interference existing in the determination process of the impurity paragraphs can be effectively eliminated by filtering any one or more items of information in numbers, letters and dynamic scripts in the paragraphs before the index value of at least one predetermined index of the paragraphs is determined, so that the identification accuracy of the impurity paragraphs can be further improved.

Corresponding to the foregoing impurity removing method, an embodiment of the present application further provides an impurity removing apparatus, and with reference to a schematic structural diagram of the impurity removing apparatus shown in fig. 5, the apparatus includes:

an obtaining unit 501, configured to obtain data information to be processed;

a splitting unit 502, configured to split the data information into at least one paragraph;

a first determining unit 503, configured to determine an index value of at least one predetermined index of a paragraph based on a paragraph feature of the paragraph, where the index value of each index of the paragraph can be used to characterize a degree of impurity of the paragraph;

a second determining unit 504, configured to determine whether a paragraph is an impurity paragraph according to an index value of the at least one predetermined index of the paragraph, so as to obtain an impurity paragraph determination result;

a clearing processing unit 505, configured to perform clearing processing on the data information to obtain target data information that does not include a foreign paragraph based on the foreign paragraph determination result.

In an implementation manner of the embodiment of the present application, the splitting unit 502 is specifically configured to: and based on the impurity paragraph determination result, carrying out impurity paragraph clearing processing on the data information to obtain target data information which does not include impurity paragraphs.

In an implementation manner of the embodiment of the present application, the first determining unit 503 is specifically configured to: and determining at least one of length weight corresponding to the length of the paragraph, position weight corresponding to the position of the paragraph in the data information, and matching degree information of words in the paragraph and impurity feature words included in each predetermined impurity feature phrase template.

In an implementation manner of the embodiment of the present application, the determining, by the first determining unit 503, matching degree information between a word in a paragraph and impurity feature words included in predetermined impurity feature phrase templates specifically includes:

and/or the presence of a gas in the gas,

In an implementation manner of the embodiment of the present application, the participles or keywords in the paragraph are matched with the feature participles or feature keywords included in the impurity feature phrase template, and are any one of the following: the words of the participles or the keywords in the paragraphs are matched with the words of the corresponding characteristic participles or the characteristic keywords in the impurity characteristic phrase template; the participles or keywords in the paragraph are matched with the corresponding characteristic participles or characteristic keywords in the impurity characteristic phrase template in sequence.

In an implementation manner of the embodiment of the present application, the second determining unit 504 is specifically configured to:

alternatively, the first and second electrodes may be,

In an implementation manner of the embodiment of the present application, the clearing processing unit 505 is specifically configured to: and recombining each non-impurity paragraph in the data information to obtain the target data information without the impurity paragraph.

In an implementation manner of the embodiment of the present application, referring to a schematic structural diagram of an impurity removing apparatus shown in fig. 6, the apparatus may further include: the filtering unit 506 is configured to filter out any one or more of numbers, letters, and dynamic scenarios in the paragraphs before determining an index value of at least one predetermined index of the paragraphs.

The impurity removing device disclosed in the embodiment of the present application is relatively simple in description because it corresponds to the impurity removing method disclosed in the embodiment above, and for the relevant similarities, please refer to the description of the impurity removing method in the embodiment above, and the detailed description is omitted here.

In summary, the impurity removing method and apparatus of the present application have the following advantages:

1) the method has the advantages that the paragraph is subjected to impurity identification based on multiple dimensions such as word segmentation/keywords, the sequence of the word segmentation/keywords, the part of speech, the position of the paragraph, the length of the paragraph and the like, so that the impurity paragraphs in data information such as network articles can be identified more three-dimensionally and effectively, and the higher accuracy of identifying and removing the data information impurities can be ensured;

2) the method is simple and convenient to develop, the structure of the data information (such as the structure of a website article) does not need to be analyzed, the complex implementation codes do not need to be compiled correspondingly, and the development of the whole core logic can be completed only by a small code amount (such as 100 lines of codes) so as not to bring high implementation complexity and implementation cost, and the failure or inaccurate capture of the effective information content of the data information caused by the upgrading or changing of the structure of the data information and the high labor cost are not needed;

3) the impurity identification and removal efficiency is high, and the inventor verifies that the implementation scheme can realize impurity identification in millisecond level and can effectively meet the removal requirement of impurity information in a capturing and releasing scene of mass data.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

For convenience of description, the above system or apparatus is described as being divided into various modules or units by function, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

Finally, it is further noted that, herein, relational terms such as first, second, third, fourth, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. An impurity removing method, comprising:

acquiring data information to be processed;

splitting the data information into at least one paragraph;

determining an index value of at least one predetermined index of the paragraph based on a paragraph feature of the paragraph, wherein the paragraph feature comprises: at least one of the length of the paragraph, the position of the paragraph in the data information, and the words in the paragraph; the index value of each index of the paragraph can be used for representing the impurity degree of the paragraph; the determining of the index value of at least one predetermined index of the paragraph comprises: determining at least one of length weight corresponding to the length of the paragraph, position weight corresponding to the position of the paragraph appearing in the data information, and matching degree information of words in the paragraph and impurity feature words included in each preset impurity feature phrase template;

the determining information of the matching degree of the words in the paragraphs and the impurity feature words included in the preset impurity feature phrase templates comprises: extracting participles and/or keywords in the paragraphs; and performs the following processing: matching the participles in the paragraphs with the characteristic participles included in each impurity characteristic phrase template; calculating the feature word segmentation matching degree of the paragraph and each impurity feature phrase template based on the matching number of the feature words included in the paragraph and each impurity feature phrase template and the total number of the feature words in the impurity feature phrase template; selecting the characteristic word segmentation matching degree with the largest value as the impurity characteristic word segmentation weight of the paragraph; and/or matching the keywords in the paragraphs with the feature keywords included in each impurity feature phrase template; calculating the matching degree of the characteristic keywords of the paragraph and each impurity characteristic phrase template based on the matching number of the paragraph and the characteristic keywords included in each impurity characteristic phrase template and the total number of the characteristic keywords in the impurity characteristic phrase template; selecting the characteristic keyword matching degree with the largest value as the impurity characteristic keyword weight of the paragraph to obtain the matching degree information comprising the impurity characteristic word segmentation weight and/or the impurity characteristic keyword weight;

2. The method of claim 1, wherein the splitting the data information into at least one paragraph comprises:

3. The method of claim 1, wherein the segmentation or keyword in the paragraph is matched with the feature segmentation or feature keyword included in the foreign feature phrase template, and is any one of the following:

4. The method of claim 1, wherein determining whether a paragraph is a ragged paragraph according to the index value of the at least one predetermined index of paragraphs comprises:

alternatively, the first and second electrodes may be,

5. The method according to claim 1, wherein performing a contaminant paragraph removal process on the data information based on the contaminant paragraph determination result to obtain target data information not including a contaminant paragraph comprises:

6. The method according to any one of claims 1 to 5, wherein before determining an index value of at least one predetermined index of a paragraph based on the paragraph feature of the paragraph, further comprising:

7. An impurity removing device, characterized by comprising:

a first determining unit, configured to determine an index value of at least one predetermined index of a paragraph based on a paragraph feature of the paragraph, where the paragraph feature includes: at least one of the length of the paragraph, the position of the paragraph in the data information, and the words in the paragraph; the index value of each index of the paragraph can be used for representing the impurity degree of the paragraph; the first determining unit is specifically configured to: determining at least one of length weight corresponding to the length of the paragraph, position weight corresponding to the position of the paragraph appearing in the data information, and matching degree information of words in the paragraph and impurity feature words included in each preset impurity feature phrase template;

the first determining unit determines matching degree information of words in the paragraphs and impurity feature words included in predetermined impurity feature phrase templates, and specifically includes: extracting participles and/or keywords in the paragraphs; and performs the following processing: matching the participles in the paragraphs with the characteristic participles included in each impurity characteristic phrase template; calculating the feature word segmentation matching degree of the paragraph and each impurity feature phrase template based on the matching number of the feature words included in the paragraph and each impurity feature phrase template and the total number of the feature words in the impurity feature phrase template; selecting the characteristic word segmentation matching degree with the largest value as the impurity characteristic word segmentation weight of the paragraph; and/or matching the keywords in the paragraphs with the feature keywords included in each impurity feature phrase template; calculating the matching degree of the characteristic key words of the paragraph and each impurity characteristic phrase template based on the matching number of the paragraph and the characteristic key words included in each impurity characteristic phrase template and the total number of the characteristic key words in the impurity characteristic phrase template; selecting the characteristic keyword matching degree with the largest value as the impurity characteristic keyword weight of the paragraph to obtain the matching degree information comprising the impurity characteristic word segmentation weight and/or the impurity characteristic keyword weight;

8. The apparatus according to claim 7, wherein the second determining unit is specifically configured to:

alternatively, the first and second electrodes may be,

9. The apparatus of any of claims 7-8, further comprising: