CN110909118A - Method, apparatus, device and medium for screening information - Google Patents

Method, apparatus, device and medium for screening information Download PDF

Info

Publication number
CN110909118A
CN110909118A CN201810986442.6A CN201810986442A CN110909118A CN 110909118 A CN110909118 A CN 110909118A CN 201810986442 A CN201810986442 A CN 201810986442A CN 110909118 A CN110909118 A CN 110909118A
Authority
CN
China
Prior art keywords
text
information
keywords
screening
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810986442.6A
Other languages
Chinese (zh)
Inventor
马安君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Chongqing Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Chongqing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Chongqing Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201810986442.6A priority Critical patent/CN110909118A/en
Publication of CN110909118A publication Critical patent/CN110909118A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a method, a device, equipment and a medium for screening information. The method comprises the following steps: preprocessing the title of the text information to obtain content information; performing word segmentation and word filtering stop processing on the content information to obtain text entries, and calculating the total weight of the text entries; the text entries are ordered according to the sequence of the overall weight from big to small, and the first a text entries are extracted as text keywords, wherein a is a positive integer smaller than the number of the text entries; calculating a matching degree value of the text information where the text keywords are located and the initial keyword lexicon according to the relative weight of the text keywords; and determining that the text information where the text keywords with the matching degree value larger than the preset threshold belong to the screening information.

Description

Method, apparatus, device and medium for screening information
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for screening information.
Background
In the face of a large amount of information, it is increasingly important to be able to quickly and accurately screen out the required information.
Currently, most of the methods adopt manual processing to collect, filter and screen a large amount of information. According to the manual processing method, the crawler model is formulated, the required information is crawled from the webpage, then simple filtering and screening are carried out according to the specific keywords, and then the crawled information is classified and screened, so that the keyword lexicon is easily updated untimely and misoperation is easily caused, and the accuracy of information screening is low.
Therefore, the technical problem of low accuracy in information screening exists.
Disclosure of Invention
The embodiment of the invention provides a method, a device, equipment and a medium for screening information, which can screen information more accurately.
According to an aspect of the embodiments of the present invention, there is provided a method of screening information, the method including:
preprocessing the title of the text information to obtain content information;
performing word segmentation and word filtering stop processing on the content information to obtain text entries, and calculating the total weight of the text entries;
the text entries are ordered according to the sequence of the overall weight from big to small, and the first a text entries are extracted as text keywords, wherein a is a positive integer smaller than the number of the text entries;
calculating a matching degree value of the text information where the text keywords are located and the initial keyword lexicon according to the relative weight of the text keywords;
and determining that the text information where the text keywords with the matching degree value larger than the preset threshold belong to the screening information.
According to another aspect of the embodiments of the present invention, there is provided an apparatus for screening information, the apparatus including:
the preprocessing module is used for preprocessing the title of the text information to obtain content information;
the weight calculation module is used for carrying out word segmentation and word filtering stop processing on the content information to obtain a text entry and calculating the total weight of the text entry;
the text entry processing module is used for sequencing the text entries according to the sequence of the overall weight from large to small and extracting the first a text entries as text keywords, wherein a is a positive integer less than the number of the text entries;
the matching degree value calculation module is used for calculating the matching degree value of the text information where the text keywords are located and the initial keyword lexicon according to the relative weight of the text keywords;
and the information classification module is used for determining that the text information where the text keywords with the matching degree value larger than the preset threshold value belong to the screening information.
According to another aspect of the embodiments of the present invention, there is provided an apparatus for screening information, the apparatus including:
a processor and a memory storing computer program instructions;
the processor, when executing the computer program instructions, implements the method of screening information as provided in any aspect of the embodiments of the present invention described above.
According to another aspect of embodiments of the present invention, there is provided a computer storage medium having computer program instructions stored thereon, the computer program instructions when executed by a processor implementing the method of screening information as provided in any one of the aspects of embodiments of the present invention described above.
The embodiment of the invention provides a method, a device, equipment and a medium for screening information. The method has the advantages that the screening range of the text information is narrowed by preprocessing the title of the text information; by calculating the total weight of the keywords in the text information, calculating the relative weight of the text keywords and calculating the matching degree value of the text information where the text keywords are located and the initial keyword lexicon according to the relative weight, the screening result can be more accurate.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 illustrates a flow diagram of a method of screening information in accordance with an embodiment of the present invention;
FIG. 2 shows a flow diagram of a method of screening information according to another embodiment of the invention;
FIG. 3 is a schematic structural diagram of an apparatus for screening information according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an apparatus for screening information according to another embodiment of the present invention;
FIG. 5 sets forth a block diagram of an exemplary hardware architecture of computing devices capable of implementing the method and apparatus for screening information according to embodiments of the present invention.
Detailed Description
Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
A method, an apparatus, a device, and a medium for screening information according to embodiments of the present invention are described in detail below with reference to the accompanying drawings. It should be noted that these examples are not intended to limit the scope of the present disclosure.
The method of screening information according to an embodiment of the present invention is described in detail below with reference to fig. 1 and 2.
For better understanding of the present invention, the method for screening information according to an embodiment of the present invention is described in detail below with reference to fig. 1, and fig. 1 shows a flowchart of the method for screening information according to an embodiment of the present invention.
As shown in fig. 1, a method 100 for screening information in the embodiment of the present invention includes the following steps:
s110, preprocessing the title of the text information to obtain the content information.
Specifically, the text information may be agricultural news.
As an example, when information is filtered, first, text information is searched for according to the area keywords, and a title of the text information corresponding to the area keywords is collected. Wherein the region keyword may be: national third and fourth level administrative names. Secondly, determining titles including keywords in the initial keyword lexicon in the acquired titles of the text information conforming to the regional keywords. And collecting key information of the text information where the determined title is located, and storing the key information into a database. The key information can be the title, the release time, the information source and the text information link of the text information.
And then, obtaining text information through text information linkage in the database, crawling content information of the text information by establishing a crawler model, and storing the content information into the database. The content information may be title, abstract, guidance, text and conclusion.
The detailed process of S110 is illustrated below by taking the example that the screening information is the information related to agriculture.
When the screening information is determined to be the information related to agriculture, firstly, selecting 'ten thousand state towns' as the regional key words, and screening the text information to obtain 200 pieces of text information which accord with the 'ten thousand state towns' regional key words. The title of 200 pieces of text information conforming to the keyword of the area of the ten thousand state towns is collected.
Then, the title including the keywords in the initial keyword lexicon is screened out from the titles of the 200 pieces of text information, and the titles of 50 pieces of text information are obtained. And collecting key information of the text information where the titles of the 50 pieces of text information are located. For example: title, release time, information source and text information link, and storing the collected title, release time, information source and text information link in a database.
Next, through the text information links in the database, the corresponding 50 pieces of text information are determined. And crawling the title, abstract, guidance, text and conclusion of the 50 pieces of text information through a crawler model.
In the embodiment of the invention, the title of the text information is preprocessed through the regional keywords, so that the screening range of the text information is reduced, and invalid information is eliminated; by determining the title including the keywords in the initial keyword lexicon, text information which does not meet the screening requirement is excluded, so that the screening result is more accurate.
And S120, performing word segmentation and word filtering stop processing on the content information to obtain a text entry, and calculating the total weight of the text entry.
In an embodiment of the present invention, first, the content information obtained in S110 is subjected to word segmentation and word filtering to obtain a text entry.
And secondly, setting position weight of a preset position, and counting the word frequency of the obtained text entry at each preset position. Wherein the preset position may be: title, abstract, introductory, text, and conclusions.
Finally, the overall weight of the text entries is calculated. The total weight of the text entry is equal to the sum of the total position weights of the text entry at all preset positions, and the total position weight is equal to the product of the position weight corresponding to one preset position where the text entry is located and the word frequency of the text entry at one preset position.
As a specific example. First, the content information obtained in S110 is subjected to word segmentation and word filtering stop processing to obtain a text entry.
Next, the title is set to 5, the abstract, the guidance and the conclusion to 3, and the body text to 1. Taking "pesticide" as an example of a text entry, counting the corresponding word frequency of the title, abstract, guidance, text and conclusion of the "pesticide" in a text message as follows: title 1, abstract 1, lead 0, text 5 and conclusion 1.
Finally, according to the fact that the overall position weight is equal to the product of the position weight corresponding to a preset position where the text entry is located and the word frequency of the text entry at the preset position, the title overall position weight of the text entry 'pesticide' is 5, the conclusion overall position weight is 3, the abstract overall position weight is 3, the guidance overall position weight is 0, and the text overall position weight is 3. From the above, the overall position weight of all positions of the text entry "pesticide" was obtained, and the overall weight of the text entry "pesticide" was obtained to be 14.
The overall weight of other text entries can be obtained by the above method, and is not described herein again.
S130, the text entries are sequenced from large to small according to the overall weight, and the first a text entries are extracted as text keywords, wherein a is a positive integer smaller than the number of the text entries.
As a specific example, the number of the text entries obtained in S110 is 200, the 200 text entries are sorted in the order of the overall weight from large to small, and the top 10 text entries are extracted as the text keywords.
In the embodiment of the invention, the text entries with higher word frequency in the text information are obtained by extracting the text entries with the top overall weight rank, and the text entries with lower word frequency in the text information are excluded, so that the screening efficiency can be improved.
And S140, calculating the matching degree value of the text information where the text keywords are located and the initial keyword lexicon according to the relative weight of the text keywords.
Specifically, the relative weight of the text keyword is equal to the ratio of the total weight of the text keyword to the sum of the total weights of all the text keywords, which can be calculated by the expression (1).
Figure BDA0001779812840000061
Wherein p isiIs the relative weight of the ith text keyword, WiIs the total weight of the ith text keyword, and a is the number of extracted text keywords.
The matching degree value of the text information where the text keywords are located and the initial keyword lexicon is equal to the sum of products of the relative weights and the matching degrees of all the text keywords, namely, the matching degree value can be calculated through an expression (2).
Figure BDA0001779812840000062
Wherein P is a matching degree value, PiIs the relative weight of the ith text keyword, ciThe matching degree of the ith text keyword and the initial keyword lexicon, and a is the number of extracted text keywords.
When the initial keyword lexicon comprises text keywords, the matching degree is equal to 1; when the initial keyword thesaurus does not include a text keyword, the degree of matching is equal to 0.
It should be noted that the degree of matching refers to the degree of matching of the text keyword with the initial keyword lexicon.
In one embodiment of the present invention, the number of extracted text keywords is 10, and the extracted text keywords are: "rain", "seeding", "weeding", "fertilizing", "pesticide", "spring rain", "air", "transportation", "economy" and "road repair". The total weights obtained by S120 for the above 10 text keywords are: 3,5,4,8,5, 14,3,9,7,6.
Then, the relative weights of the above 10 text keywords calculated by expression (1) are: "rain water" relative weight 0.046875, "seeding" relative weight 0.078125, "weeding" relative weight 0.062500, "fertilizing" relative weight 0.125000, "pesticide" relative weight 0.078125, "spring rain" relative weight 0.218750, "air" relative weight 0.046875, "transportation" relative weight 0.140625, "economy" relative weight 0.109375, "road-repairing" relative weight 0.09375.
It should be noted that the above calculation result of the relative weight value retains 6 significant digits, which is merely exemplary, and the number of the retained digits can be set according to actual requirements.
And finally, judging whether the initial keyword library comprises the 10 text keywords. It was determined that "seeding", "weeding", "fertilizing", "pesticide", "spring rain", and "air" were included in the initial keyword library, the matching degrees thereof were all 1, and that "rain", "transportation", "economy", and "road repair" were not included in the initial keyword library, the matching degrees thereof were all 0.
And (3) according to the matching degree and the relative weight corresponding to the text keyword, calculating by an expression (2) to obtain a matching degree value of 0.609375.
S150, determining that the text information where the text keywords with the matching degree value larger than the preset threshold belong to the screening information.
In an embodiment of the present invention, if the matching degree value obtained in S140 is 0.609375 and is greater than the preset threshold value of 0.5, it may be determined that the text information where the text keyword is located belongs to the filtering information. It should be noted that the preset threshold value may be preset according to actual requirements.
In the embodiment of the invention, during information screening, the title comprising the keywords in the initial keyword lexicon is screened out by preprocessing the title of the text information, so that the screening range of the text information is reduced, and invalid information is eliminated; by calculating the total weight, the relative weight and the matching degree value and determining that the text information where the text keywords with the matching degree value larger than the preset threshold belong to the screening information, the screening process is more rigorous and the screening result is more accurate.
It should be noted that, before S110, the method 100 for screening information in the embodiment of the present invention further includes:
and S160, determining an initial keyword word bank.
In an embodiment of the present invention, the screening information is agricultural information. The filtering information may be other information.
When the screening information is determined to be the information related to the agriculture, firstly, a title, an information source and a text information link of the information related to the agriculture are crawled in a specific website according to a crawler model. Next, through an implicit dirichlet allocation (LDA) model, the agricultural keywords are extracted according to semantics, and an initial keyword lexicon of the agricultural information is determined.
It should be noted that the specific website may be a chinese agricultural information network, a chinese agricultural network, a ministry of agriculture of the people's republic of china official network, a chinese rural network, a chinese agricultural technology information network, and a chinese rural news network.
In the embodiment of the invention, the LDA model is used for extracting the keywords according to the semantics, so that fuzzy matching and semantic matching can be performed, and the accuracy of the initial keyword lexicon is improved.
For ease of understanding, fig. 2 shows a flow chart of a method of screening information according to another embodiment of the present invention. The steps in fig. 2 that are the same as in fig. 1 are given the same reference numerals.
As shown in fig. 2, the method 200 for screening information has the same steps as the method 100 for screening information shown in fig. 1, and is not described herein again. The method 200 for screening information in the embodiment of the present invention further includes the following steps:
s210, sorting the text keywords of the screening information according to the sequence of the relative weights from large to small.
In an embodiment of the present invention, the method for obtaining the screening information through the above method is not described herein again. And after the screening information is obtained, sequencing the text keywords of the screening information according to the sequence of the relative weights from large to small.
As a specific example, the text keywords of the screening information are sorted in the order of the relative weights from large to small, and the following results are obtained: "spring rain", "transportation", "fertilization", "economy", "road repair", "seeding", "pesticide", "weeding", "rain" and "air".
S220, extracting the top m text keywords from the text keywords of the screening information, wherein m is a positive integer smaller than the number of the text keywords.
As a specific example, the top 5 text keywords are extracted from the text keywords ranked in S210, and the following results are obtained: "spring rain", "transportation", "fertilization", "economy" and "road repair".
And S230, adding the text keywords which are not extracted from the initial keyword thesaurus to the initial keyword thesaurus.
As a specific example. First, through S220, the first 5 text keywords "spring rain", "transportation", "fertilization", "economy", and "road repair" are extracted. Next, it is determined that "transport", "economy", and "road repair" among the above-mentioned 5 text keywords are not included in the initial keyword thesaurus. Finally, "transport," "economic," and "road repair" are added to the initial keyword thesaurus.
In the embodiment of the invention, the text keywords of which the screening information is ranked at the top relative to the weight are extracted, and the extracted text keywords which are not in the initial keyword lexicon are added into the initial keyword lexicon, so that the keyword lexicon can be automatically updated in time, and the screening result is more accurate.
The apparatus for screening information according to an embodiment of the present invention, which corresponds to the method for screening information, is described in detail below with reference to fig. 3 and 4.
Fig. 3 is a schematic structural diagram of an apparatus of a method of screening information according to an embodiment of the present invention.
As shown in fig. 3, the apparatus 300 for filtering information includes:
the preprocessing module 310 is configured to preprocess a title of the text information to obtain content information;
the weight calculation module 320 is configured to perform word segmentation and word filtering stop processing on the content information to obtain a text entry, and calculate an overall weight of the text entry;
the text entry processing module 330 is configured to sort the text entries according to a sequence of overall weights from large to small, and extract first a text entries as text keywords, where a is a positive integer smaller than the number of the text entries;
the matching degree value calculation module 340 is configured to calculate a matching degree value between the text information where the text keyword is located and the initial keyword lexicon according to the relative weight of the text keyword;
and the information classification module 350 is configured to determine that the text information where the text keyword with the matching degree value larger than the preset threshold belongs to the screening information.
By the information screening device in the above embodiment, the preprocessing module 310 preprocesses the title of the text information to screen out the title including the keywords in the initial keyword lexicon, so that the screening range of the text information is narrowed, and invalid information is excluded;
through the weight calculation module 320, the text entry processing module 330, the matching degree value calculation module 340 and the information classification module 350, the screening process is more rigorous, and the screening result is more accurate.
In one embodiment of the present invention, the apparatus 300 for screening information further comprises:
and an initial keyword lexicon determining module 360, configured to determine an initial keyword lexicon.
The initial keyword thesaurus determination module 360 is further specifically configured to crawl titles, information sources, and text information links of the text information through a crawler model. And extracting keywords according to the semantics through an LDA model, and determining an initial keyword lexicon.
In the initial keyword lexicon determination module 360, keywords are extracted according to semantics through the LDA model, fuzzy matching and semantic matching can be performed, and the accuracy of the initial keyword lexicon is improved.
In an embodiment of the present invention, the preprocessing module 310 is specifically configured to search for text information according to the regional keywords, and collect a title of the text information. The regional keywords may be names of national third-level and fourth-level administration units.
And determining the title comprising the keywords in the initial keyword lexicon in the title of the acquired text information.
And collecting key information of the text information where the determined title is located. The key information may be title, release time, information source and text information link.
And obtaining the content information of the text information where the key information is located through text information linkage. The content information may be title, abstract, guidance, text and conclusion.
In the preprocessing module 310, the text information is screened through the regional keywords, so that the screening range of the text information is reduced, and invalid information is eliminated; by determining the title including the keywords in the initial keyword lexicon, text information which does not meet the screening requirement is excluded, so that the screening result is more accurate.
In an embodiment of the present invention, the weight calculating module 320 is specifically configured to, first, perform word segmentation and word filtering stop processing on the content information to obtain a text entry. Secondly, setting position weight of a preset position, and counting word frequency of the text entry at each preset position, wherein the preset position can be a title, an abstract, a reading guide, a text and a conclusion.
Finally, the overall weight of the text entry is calculated. The total weight of the text entry is equal to the sum of the total position weights of the text entry at all preset positions, and the total position weight is equal to the product of the position weight corresponding to one preset position where the text entry is located and the word frequency of the text entry at one preset position.
In an embodiment of the present invention, the matching degree value calculating module 340 is specifically configured to calculate a matching degree value between the text information where the text keyword is located and the initial keyword lexicon according to the relative weight.
The matching degree value of the text information where the text keywords are located and the initial keyword lexicon is equal to the sum of products of the relative weight of all the text keywords and the matching degree. And the relative weight of the text keywords is equal to the ratio of the total weight of the text keywords to the sum of the total weights of all the text keywords. It should be noted that the degree of matching refers to the degree of matching of the text keyword with the initial keyword lexicon.
When the initial keyword lexicon comprises text keywords, the matching degree of the text keywords and the initial keyword lexicon is equal to 1; when the initial keyword thesaurus does not include the text keyword, the matching degree of the text keyword and the initial keyword thesaurus is equal to 0.
Fig. 4 is a schematic structural diagram of an apparatus 400 for screening information according to another embodiment of the present invention.
Like reference numerals are used for like blocks in fig. 4 and 3. As shown in fig. 4, the apparatus 400 for screening information is substantially the same as the apparatus 300 for screening information shown in fig. 3, except that the apparatus 400 for screening information further includes:
the relative weight value sorting module 410 is configured to sort the text keywords of the screening information in an order from a larger relative weight value to a smaller relative weight value.
And a secondary text keyword extraction module 420, configured to extract, from the text keywords of the screening information, top m text keywords, where m is a positive integer smaller than the number of text keywords.
And a keyword thesaurus updating module 430, configured to add text keywords that are not extracted from the initial keyword thesaurus to the initial keyword thesaurus.
By the device for screening information according to the embodiment, the text keywords of which the screening information is ranked at the top relative to the weight are extracted, and the extracted text keywords which are not in the initial keyword lexicon are added to the initial keyword lexicon, so that the keyword lexicon can be updated timely and automatically, and the screening result is more accurate.
FIG. 5 sets forth a block diagram of an exemplary hardware architecture of computing devices capable of implementing the method and apparatus for screening information according to embodiments of the present invention.
As shown in fig. 5, computing device 500 includes an input device 501, an input interface 502, a central processor 503, a memory 504, an output interface 505, and an output device 506. The input interface 502, the central processing unit 503, the memory 504, and the output interface 505 are connected to each other through a bus 510, and the input device 501 and the output device 506 are connected to the bus 510 through the input interface 502 and the output interface 505, respectively, and further connected to other components of the computing device 500.
Specifically, the input device 501 receives input information from the outside and transmits the input information to the central processor 503 through the input interface 502; the central processor 503 processes input information based on computer-executable instructions stored in the memory 504 to generate output information, temporarily or permanently stores the output information in the memory 504, and then transmits the output information to the output device 506 through the output interface 505; output device 506 outputs the output information outside of computing device 500 for use by a user.
That is, the computing device shown in fig. 5 may also be implemented with a device for filtering information, which may include: a memory storing computer-executable instructions; and a processor which, when executing computer executable instructions, may implement the method and apparatus for screening information described in connection with fig. 1-4.
An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium has computer program instructions stored thereon; the computer program instructions, when executed by a processor, implement the method of screening information provided by embodiments of the present invention.
It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention. The functional blocks shown in the above-described structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. For example, the algorithms described in the specific embodiments may be modified without departing from the basic spirit of the invention. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (10)

1. A method of screening information, comprising:
preprocessing the title of the text information to obtain content information;
performing word segmentation and word filtering stop processing on the content information to obtain a text entry, and calculating the total weight of the text entry;
sequencing the text entries according to the sequence of the overall weight from large to small, and extracting the first a text entries as text keywords, wherein a is a positive integer smaller than the number of the text entries;
calculating a matching degree value of the text information where the text keywords are located and an initial keyword lexicon according to the relative weight of the text keywords;
and determining that the text information where the text keywords with the matching degree value larger than a preset threshold belong to screening information.
2. The method of claim 1, wherein the preprocessing the header of the text message to obtain the content information further comprises:
and determining the initial keyword word bank.
3. The method of claim 1, wherein preprocessing the header of the text message to obtain the content information comprises:
searching text information according to regional keywords, and acquiring titles of the text information, wherein the regional keywords comprise: national third and fourth level administrative names;
determining a title comprising keywords in the initial keyword lexicon in the title of the text information;
collecting key information of the text information where the determined title is located, wherein the key information comprises: title, release time, information source and text information link;
and obtaining content information of the text information where the key information is located through the text information link, wherein the content information comprises: title, abstract, introductory, text, and conclusions.
4. The method of claim 1, wherein said calculating an overall weight of said text entries comprises:
setting position weight of a preset position, and counting word frequency of the text entry at each preset position, wherein the preset position comprises the following steps: title, abstract, guide, text and conclusion;
the total weight of the text entry is equal to the sum of the total position weights of the text entry at all preset positions, and the total position weight is equal to the product of the position weight corresponding to one preset position where the text entry is located and the word frequency of the text entry at the preset position.
5. The method of claim 1, wherein the relative weight of the text keywords is equal to a ratio of the overall weight of the text keywords to a sum of the overall weights of all the text keywords.
6. The method for screening information according to claim 1, wherein said calculating a matching degree value between the text information where the text keyword is located and an initial keyword lexicon according to the relative weight value comprises:
the matching degree value of the text information where the text keywords are located and the initial keyword lexicon is equal to the sum of products of the relative weight of all the text keywords and the matching degree;
wherein the matching degree comprises:
when the initial keyword lexicon comprises the text keywords, the matching degree of the text keywords and the initial keyword lexicon is equal to 1;
and when the initial keyword lexicon does not comprise the text keywords, the matching degree of the text keywords and the initial keyword lexicon is equal to 0.
7. The method for screening information according to claim 1, wherein after determining that the text information where the text keyword with the matching degree value greater than the preset threshold value is located belongs to the screening information, the method further comprises:
sorting the text keywords of the screening information according to the sequence of the relative weights from large to small;
extracting the first m text keywords from the text keywords of the screening information, wherein m is a positive integer smaller than the number of the text keywords;
and adding the text keywords which are not extracted from the initial keyword thesaurus to the initial keyword thesaurus.
8. An apparatus for screening information, comprising:
the preprocessing module is used for preprocessing the title of the text information to obtain content information;
the weight calculation module is used for carrying out word segmentation and word filtering stop processing on the content information to obtain a text entry and calculating the total weight of the text entry;
the text entry processing module is used for sequencing the text entries according to the sequence from big to small of the overall weight and extracting the first a text entries as text keywords, wherein a is a positive integer smaller than the number of the text entries;
the matching degree value calculation module is used for calculating the matching degree value of the text information where the text keywords are located and the initial keyword lexicon according to the relative weight of the text keywords;
and the information classification module is used for determining that the text information where the text keywords with the matching degree value larger than a preset threshold value belong to screening information.
9. An apparatus for screening information, the apparatus comprising: a processor and a memory storing computer program instructions;
the processor, when executing the computer program instructions, implements a method of screening information according to any one of claims 1-7.
10. A computer storage medium having computer program instructions stored thereon, which when executed by a processor, implement a method of screening information according to any one of claims 1-7.
CN201810986442.6A 2018-08-28 2018-08-28 Method, apparatus, device and medium for screening information Pending CN110909118A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810986442.6A CN110909118A (en) 2018-08-28 2018-08-28 Method, apparatus, device and medium for screening information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810986442.6A CN110909118A (en) 2018-08-28 2018-08-28 Method, apparatus, device and medium for screening information

Publications (1)

Publication Number Publication Date
CN110909118A true CN110909118A (en) 2020-03-24

Family

ID=69812130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810986442.6A Pending CN110909118A (en) 2018-08-28 2018-08-28 Method, apparatus, device and medium for screening information

Country Status (1)

Country Link
CN (1) CN110909118A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541064A (en) * 2020-12-09 2021-03-23 联仁健康医疗大数据科技股份有限公司 Health evaluation method and device, computer equipment and storage medium
CN115409035A (en) * 2022-06-02 2022-11-29 北京金堤科技有限公司 Conversation information acquisition method and device, storage medium and electronic equipment
CN116306621A (en) * 2023-05-24 2023-06-23 北京拓普丰联信息科技股份有限公司 Violation detection method and device for bidding text and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007179490A (en) * 2005-12-28 2007-07-12 Research Organization Of Information & Systems Information resource retrieval device, information resource retrieval method and information resource retrieval program
CN101089843A (en) * 2006-06-15 2007-12-19 王刘忠 Search method only for product or service supply information
KR100818742B1 (en) * 2007-08-09 2008-04-02 이종경 Search methode using word position data
CN101216842A (en) * 2008-01-07 2008-07-09 华为技术有限公司 Method for obtaining page key words and page information processing apparatus
CN103294820A (en) * 2013-06-14 2013-09-11 广东电网公司电力科学研究院 WEB page classifying method and system based on semantic extension
CN105138558A (en) * 2015-07-22 2015-12-09 山东大学 User access content-based real-time personalized information collection method
CN105630769A (en) * 2015-12-24 2016-06-01 东软集团股份有限公司 Document subject term extraction method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007179490A (en) * 2005-12-28 2007-07-12 Research Organization Of Information & Systems Information resource retrieval device, information resource retrieval method and information resource retrieval program
CN101089843A (en) * 2006-06-15 2007-12-19 王刘忠 Search method only for product or service supply information
KR100818742B1 (en) * 2007-08-09 2008-04-02 이종경 Search methode using word position data
CN101216842A (en) * 2008-01-07 2008-07-09 华为技术有限公司 Method for obtaining page key words and page information processing apparatus
CN103294820A (en) * 2013-06-14 2013-09-11 广东电网公司电力科学研究院 WEB page classifying method and system based on semantic extension
CN105138558A (en) * 2015-07-22 2015-12-09 山东大学 User access content-based real-time personalized information collection method
CN105630769A (en) * 2015-12-24 2016-06-01 东软集团股份有限公司 Document subject term extraction method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541064A (en) * 2020-12-09 2021-03-23 联仁健康医疗大数据科技股份有限公司 Health evaluation method and device, computer equipment and storage medium
CN115409035A (en) * 2022-06-02 2022-11-29 北京金堤科技有限公司 Conversation information acquisition method and device, storage medium and electronic equipment
CN116306621A (en) * 2023-05-24 2023-06-23 北京拓普丰联信息科技股份有限公司 Violation detection method and device for bidding text and electronic equipment
CN116306621B (en) * 2023-05-24 2023-08-04 北京拓普丰联信息科技股份有限公司 Violation detection method and device for bidding text and electronic equipment

Similar Documents

Publication Publication Date Title
CN106599278B (en) Application search intention identification method and device
CN102411563B (en) Method, device and system for identifying target words
US7409404B2 (en) Creating taxonomies and training data for document categorization
CN103605665B (en) Keyword based evaluation expert intelligent search and recommendation method
CN106599054B (en) Method and system for classifying and pushing questions
CN110543595B (en) In-station searching system and method
CN107885793A (en) A kind of hot microblog topic analyzing and predicting method and system
CN111797239B (en) Application program classification method and device and terminal equipment
CN106156372B (en) A kind of classification method and device of internet site
CN107291895B (en) Quick hierarchical document query method
CN109558587B (en) Method for classifying public opinion tendency recognition aiming at category distribution imbalance
CN110737821B (en) Similar event query method, device, storage medium and terminal equipment
CN104484380A (en) Personalized search method and personalized search device
CN103577478A (en) Web page pushing method and system
CN105893362A (en) A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points
CN102646100A (en) Domain term obtaining method and system
CN114757302A (en) Clustering method system for text processing
CN110909118A (en) Method, apparatus, device and medium for screening information
CN112149422A (en) Enterprise news dynamic monitoring method based on natural language
CN110795573A (en) Method and device for predicting geographic position of webpage content
CN107169020B (en) directional webpage collecting method based on keywords
CN111475464A (en) Method for automatically discovering and mining fingerprints of Web component
CN111400495A (en) Video bullet screen consumption intention identification method based on template characteristics
CN115640439A (en) Method, system and storage medium for network public opinion monitoring
CN112883191B (en) Agricultural entity automatic identification classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200324

RJ01 Rejection of invention patent application after publication