CN112364625A

CN112364625A - Text screening method, device, equipment and storage medium

Info

Publication number: CN112364625A
Application number: CN202011302193.8A
Authority: CN
Inventors: 董润华
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Smart Technology Co Ltd; OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2021-02-12
Also published as: WO2022105497A1

Abstract

The invention relates to a data processing technology of big data, and provides a text screening method, a text screening device, text screening equipment and a storage medium. The method includes the steps of performing word segmentation operation on a first text to be screened to obtain a plurality of word segments, extracting keywords with preset word characteristics, distributing weights to the word segments and the keywords, calculating hash values of the word segments and the keywords, obtaining weight vectors of the word segments and the weight vectors of the keywords according to the hash values and the weights, accumulating the weight vectors to obtain a first weight vector and a second weight vector of the text, performing dimensionality reduction on the first weight vector and the second weight vector to obtain a first simhash value and a second simhash value of the text, calculating a distance value between the first simhash value and a third simhash value of a target text, calculating a distance value between the second simhash value and the third simhash value when the distance value is larger than a first preset value, and screening the first text when the distance value is smaller than or equal to a second preset value. The present invention may emphasize abstract or summarized text.

Description

Text screening method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of data processing of big data, in particular to a text screening method, a text screening device, text screening equipment and a storage medium.

Background

When crawling a text, a crawler needs to duplicate the crawled same or extremely high-similarity text, mostly, the text duplicate removal operation uses a URL to generate a fingerprint, the fingerprint is placed in a set for duplicate removal, in the actual application process, one text can be forwarded by a plurality of websites, the fingerprints are different, the text content is the same, and in the duplicate removal operation process, when crawling the abstract text of a certain text or the summary text of the certain text, the abstract or summary text is difficult to duplicate.

Disclosure of Invention

In view of the above, the present invention provides a text screening method, apparatus, device and storage medium, which aims to solve the technical problem that it is difficult to emphasize abstract or summarized text in the prior art.

In order to achieve the above object, the present invention provides a text screening method, including:

performing word segmentation operation on a first text to be screened to obtain a plurality of segmented words, extracting keywords with preset parts of speech from the segmented words, and distributing associated weights to the segmented words and the keywords;

calculating first hash values of the participles and the keywords, performing weighting operation based on the first hash values and the weights of the participles to obtain weight vectors of the participles, and performing weighting operation based on the first hash values and the weights of the keywords to obtain the weight vectors of the keywords;

accumulating the weight vectors of all the participles to obtain a first weight vector of the first text, accumulating the weight vectors of all the keywords to obtain a second weight vector of the first text, and performing dimensionality reduction operation on the first weight vector and the second weight vector to respectively obtain a first simhash value and a second simhash value of the first text;

and calculating a first distance value between the first simhash value and a third simhash value of a target text in a preset storage space, calculating a second distance value between the second simhash value and the third simhash value when the first distance value is larger than a first preset value, and screening out the first text when the second distance value is smaller than or equal to a second preset value.

Preferably, the extracting the keywords with the preset part of speech from the plurality of segmented words includes:

calculating the word frequency of each participle in the first text, calculating the IDF value and the TF value of each participle based on the word frequency, multiplying the IDF value of each participle with the TF value corresponding to each participle to obtain the TF-ID value of each participle, judging whether preset part-of-speech keywords with the number larger than the preset number exist in the first text, if so, selecting the preset part-of-speech keywords with the number larger than the preset number based on the TF-ID value of each participle, wherein the preset part-of-speech keywords comprise noun keywords and verb keywords.

Preferably, the determining whether the first text contains the keywords with the preset parts of speech, where the number of the keywords is greater than a preset number, includes:

and when judging that the first text does not have keywords with preset parts of speech of which the number is larger than the preset number, screening the first text, and randomly acquiring a text from a preset storage space as the first text to be screened to perform word segmentation operation again.

Preferably, the method further comprises:

and when the first distance value is smaller than or equal to the first preset value, screening out the first text.

Preferably, the method further comprises:

and when the second distance value is larger than the second preset value, storing the first text to a text set to which the preset target text belongs.

Preferably, when the first distance value is greater than a first preset value, the method further includes:

and calculating a third distance value between the first simhash value and a fourth simhash value of the target text, and screening out the first text when the third distance value is smaller than or equal to a third preset value.

Preferably, the performing a word segmentation operation on the first text to be filtered to obtain a plurality of words comprises:

matching the read text with the word stock according to a forward maximum matching method to obtain a first matching result, wherein the first matching result comprises a first number of first word groups and a second number of single words;

matching the read text with the word stock according to a reverse maximum matching method to obtain a second matching result, wherein the second matching result comprises a third number of second word groups and a fourth number of single words;

if the first number is equal to the third number and the second number is smaller than or equal to the fourth number, or if the first number is smaller than the third number, taking the first matching result as a word segmentation result of the object full name; and if the first number is equal to the second number and the third number is greater than the fourth number, or if the first number is greater than the third number, taking the second matching result as a word segmentation result of the first text.

In order to achieve the above object, the present invention further provides a text screening apparatus, including:

an extraction module: the system comprises a word segmentation unit, a word segmentation unit and a word segmentation unit, wherein the word segmentation unit is used for performing word segmentation operation on a first text to be screened to obtain a plurality of words, extracting keywords with preset parts of speech from the plurality of words, and distributing associated weights for the words and the keywords;

a weighting module: the system comprises a database, a word segmentation unit, a weighting unit and a processing unit, wherein the database is used for calculating a first hash value of each word segmentation and each keyword, executing weighting operation based on the first hash value and the weight of each word segmentation to obtain a weight vector of each word segmentation, and executing weighting operation based on the first hash value and each weight of each keyword to obtain a weight vector of each keyword;

a dimension reduction module: the system comprises a first text, a second text, a first weight vector, a second weight vector and a first simhash value, wherein the first text is obtained by accumulating the weight vectors of all participles;

a screening module: and the first distance value is used for calculating a first distance value between the first simhash value and a third simhash value of a target text of a preset storage space, when the first distance value is larger than a first preset value, a second distance value between the second simhash value and the third simhash value is calculated, and when the second distance value is smaller than or equal to a second preset value, the first text is screened out.

In order to achieve the above object, the present invention also provides an electronic device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a program executable by the at least one processor to enable the at least one processor to perform any of the steps of the text screening method as described above.

To achieve the above object, the present invention further provides a computer-readable storage medium, which includes a storage data area and a storage program area, the storage data area stores data created according to the use of the blockchain node, the storage program area stores a text filtering program, and the text filtering program, when executed by a processor, implements any of the steps of the text filtering method as described above.

According to the text screening method, the text screening device, the text screening equipment and the storage medium, word segmentation operation is performed on a first text, keywords with preset parts of speech in the first text are extracted, associated weights are distributed to the words and the keywords, corresponding text vectors corresponding to the words and vectors corresponding to the keywords are calculated through a hash function and the weights to obtain corresponding simhash values, then the distances between the simhash values and the preset values are compared, the similarity of the texts is judged through the distances, the identification process of the text similarity can be improved, and when the first text is an abstract text of a certain text or a summarized text of the certain text, the keywords and the words are combined to obtain the simhash values, so that the abstract or summarized text can be accurately subjected to re-adding operation.

Drawings

FIG. 1 is a diagram of an electronic device according to a preferred embodiment of the present invention;

FIG. 2 is a block diagram of a preferred embodiment of the text screening apparatus of FIG. 1;

FIG. 3 is a flow chart of a preferred embodiment of a text screening method of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic diagram of an electronic device 1 according to a preferred embodiment of the invention.

The electronic device 1 includes but is not limited to: memory 11, processor 12, display 13, and network interface 14. The electronic device 1 is connected to a network through a network interface 14 to obtain raw data. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System for Mobile communications (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, or a communication network.

The memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the electronic device 1, such as a hard disk or a memory of the electronic device 1. In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like equipped with the electronic device 1. Of course, the memory 11 may also comprise both an internal memory unit and an external memory device of the electronic device 1. In this embodiment, the memory 11 is generally used for storing an operating system installed in the electronic device 1 and various application software, such as program codes of the text filter 10. Further, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is typically used for controlling the overall operation of the electronic device 1, such as performing data interaction or communication related control and processing. In this embodiment, the processor 12 is configured to run the program code stored in the memory 11 or process data, for example, run the program code of the text filter 10.

The display 13 may be referred to as a display screen or display unit. In some embodiments, the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch screen, or the like. The display 13 is used for displaying information processed in the electronic device 1 and for displaying a visual work interface, e.g. displaying the results of data statistics.

The network interface 14 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), the network interface 14 typically being used for establishing a communication connection between the electronic device 1 and other electronic devices.

FIG. 1 shows only the electronic device 1 having the components 11-14 and the text filter 10, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

Optionally, the electronic device 1 may further comprise a user interface, the user interface may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further comprise a standard wired interface and a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch screen, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.

The electronic device 1 may further include a Radio Frequency (RF) circuit, a sensor, an audio circuit, and the like, which are not described in detail herein.

In the above embodiment, the processor 12, when executing the text filtering program 10 stored in the memory 11, may implement the following steps:

The storage device may be the memory 11 of the electronic device 1, or may be another storage device communicatively connected to the electronic device 1.

For a detailed description of the above steps, please refer to the following description of fig. 2 regarding a functional block diagram of an embodiment of the text filtering apparatus 100 and fig. 3 regarding a flowchart of an embodiment of the text filtering method.

Referring to fig. 2, a functional block diagram of the text filtering apparatus 100 according to the present invention is shown.

The text filtering apparatus 100 according to the present invention may be installed in an electronic device. According to the implemented functions, the text filtering apparatus 100 may include an extraction module 110, a weighting module 120, a dimension reduction module 130, and a screening module 140. A module according to the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

the extracting module 110 is configured to perform a word segmentation operation on the first text to be filtered to obtain a plurality of words, extract a keyword with a preset part of speech from the plurality of words, and assign a relevant weight to each word and each keyword.

In this embodiment, the technical solution is described by taking an example that the crawled same or extremely similar texts need to be deduplicated when the crawler crawls the texts, and it should be understood that an application scenario of the technical solution is not limited to this. When a certain text is crawled, whether the text is similar to or the same as the crawled text needs to be judged, if so, the text can be screened out, specifically, when a first text to be deduplicated is obtained, a word segmentation operation is performed on the first text to obtain a plurality of participles, keywords with preset parts of speech in the first text are extracted from the plurality of participles, wherein the preset parts of speech keywords can be keywords belonging to nouns and keywords belonging to verbs, associated weights are distributed to the participles and the keywords, and the distributed weights can be distributed according to the number of the participles.

For example, the first text contains a sentence: "author July" of the way of the CSDN blog structure algorithm, after word segmentation: "author July of the way of the CSDN blog structure algorithm", then assigns a weight to each participle: CSDN (4) blog (5) structure (3) (1) method (2) algorithm (3) (1) author (1) track (2) (5) July (5), wherein the number in the brackets represents the importance degree of the word in the whole sentence, and the larger the number is, the more important the word is.

In one embodiment, the extracting the keywords of the preset part of speech from the plurality of segmented words includes:

And counting the occurrence times of all the words in the first text, calculating an IDF (inverse document frequency value), and then calculating a TF (word frequency) value of each word in the first text. And multiplying the IDF value by the TF value to obtain a TF-IDF value of the word, wherein the TF-IDF value can evaluate the importance degree of the word in the speech text, and the larger the TF-IDF value is, the higher the priority of the word is. When the TF-IDF calculation is carried out, the TF-IDF value of a certain word is obtained through the word frequency and the inverse document frequency, if the TF-IDF value is larger, the importance of the word to the text is higher, and therefore the words with the TF-IDF value arranged in front can be used as the keywords of the first text. And judging whether keywords with preset parts of speech of which the number is more than a preset number (for example, 20) exist in the first text, and if so, selecting the noun keywords and verb keywords with the TF-IDF value in the top 20 as the keywords with the preset parts of speech of the first text.

Further, when it is judged that the first text does not have the keywords with the preset parts of speech with the number larger than the preset number, the first text is screened out, and one text is randomly acquired from a preset storage space and used as the first text to be screened to perform word segmentation again.

And when the number of the keywords with the preset parts of speech in the first text is less than 20, deleting the first text, and randomly acquiring a text from a preset storage space as the first text to be screened to perform word segmentation again. The text with insufficient keywords is deleted, further operations such as Hash and dimension reduction can be avoided being performed on the text with insufficient keywords (namely the text with unobvious characteristics), and the duplicate removal speed of massive texts is improved.

In one embodiment, performing a word segmentation operation on a first text to be filtered to obtain a plurality of words comprises:

The segmentation matching results with less single characters and more phrases are found out by simultaneously performing segmentation matching in the forward direction and the reverse direction and are used as the segmentation results of the segmented sentences, so that the segmentation accuracy can be improved.

The weighting module 120 is configured to calculate a first hash value of each participle and each keyword, perform a weighting operation based on the first hash value and the weight of each participle to obtain a weight vector of each participle, and perform a weighting operation based on the first hash value and each weight of each keyword to obtain a weight vector of each keyword.

In this embodiment, a Hash function may be used to calculate a first Hash value of each participle and a first Hash value of each keyword, where the first Hash value is an n-bit signature composed of binary numbers "0" and "1", for example, the Hash value Hash (CSDN) of "CSDN" is "100101", and the Hash value Hash (blog) of "blog" is "101011". And then, according to the hash value of each participle and the weight corresponding to each participle, executing weighting operation to obtain the weight vector of the participle, and according to the hash value of each keyword and the weight corresponding to each keyword, executing weighting operation to obtain the weight vector of the keyword.

Specifically, on the basis of the first Hash value, weighting is performed on each participle and keyword, that is, W is Hash weight, and when 1 is encountered, the Hash value and the weight are multiplied positively, and when 0 is encountered, the Hash value and the weight are multiplied negatively. For example, the weight vector obtained by performing a weighting operation on the hash value "100101" of "CSDN": w (csdn) ═ 1001014 ═ 4-4-44-44, and the hash value "101011" of "blog" is given a weighted vector: w (blog) 1010115-55-555, the remaining participles and keywords perform similar operations.

The dimension reduction module 130 is configured to accumulate the weight vectors of the participles to obtain a first weight vector of the first text, accumulate the weight vectors of the keywords to obtain a second weight vector of the first text, and perform dimension reduction on the first weight vector and the second weight vector to obtain a first simhash value and a second simhash value of the first text, respectively.

In this embodiment, the weight vectors of the participles are accumulated to obtain a first weight vector of the first text, the weight vectors of the keywords are accumulated to obtain a second weight vector of the first text, and the weight vectors of the participles or the keywords are accumulated to obtain a sequence string as the first weight vector or the second weight vector of the first text, for example, the weight vectors of "4-4-44-44" of "CSDN" and "5-55-555" of "blog" are accumulated to obtain "4 + 5-4 + -5-4 + 54 + -5-4 + 54 + 5", and "9-91-11" is obtained.

And then, performing dimensionality reduction operation on the first weight vector and the second weight vector, mapping the high-dimensional feature vector into a low-dimensional feature vector, so as to improve the processing speed, and obtain a first simhash value and a second simhash value of the first text, wherein the first simhash value is a simhash value corresponding to the first text participle, and the second simhash value is a simhash value corresponding to the keyword of the first text, specifically, for the weight vector of the first text, if the weight vector is greater than 0, setting the weight vector to 1, otherwise, setting the weight vector to 0, so as to obtain the first simhash value and the second simhash value of the first text. For example, the above calculated "9-91-119" is subjected to dimension reduction (a bit is greater than 0 and set to 1, and a bit less than 0 and set to 0), and the obtained simhash value is: "101011".

The filtering module 140 is configured to calculate a first distance value between the first simhash value and a third simhash value of a target text in a preset storage space, calculate a second distance value between the second simhash value and the third simhash value when the first distance value is greater than a first preset value, and filter the first text when the second distance value is less than or equal to a second preset value.

In this embodiment, in the actual text deduplication operation process, a text with a high similarity to the text, an abstract text of a certain text, or a summarized text of a certain text, for example, a summarized bulletin of a certain stock may be crawled, and a summarized bulletin text with more detail than the summarized bulletin text may exist, if whether the two texts are similar is judged only according to a first simhash value obtained by text segmentation, a judgment result may be that the two texts are not repeated, and therefore, it is necessary to further judge and compare whether the two texts are similar texts by combining with a second simhash value obtained according to text keywords.

Specifically, a first distance value between the first simhash value and a third simhash value of the target text is calculated, and it is understood that the third simhash value may be a simhash value obtained by particlizing the target text. The first distance value may be a hamming distance value, and when the first distance value is greater than a first preset value (e.g., 3), it may be stated that the two texts are determined to be different or dissimilar according to a first simhash value obtained by text segmentation, at this time, a second distance value between a second simhash value and a third simhash value of the target text may be further calculated, and when the second distance value is less than a second preset value, it may be stated that the two texts belong to similar texts according to a text simhash value obtained by text keywords, at this time, the first text may be filtered out, where the second preset value may be set according to an actual situation. It can be understood that the target text in the preset storage space refers to a text that is similar to or identical to the first file in comparison, and the target text may be a text crawled before the first text or any text in a text set in a database.

In one embodiment, the first text is screened out when the first distance value is less than or equal to the first preset value. When the first distance value of the two texts is smaller than the first preset value, the similarity of the two texts is higher, and the first text can be screened out.

Further, when the second distance value is greater than the second preset value, the first text is stored to a text set to which the preset target text belongs. When the second distance value is larger than the second preset value, the two texts are judged not to belong to the similar texts according to the simhash value obtained by the text key words, so that the first text can be reserved.

In one embodiment, when the first distance value is greater than a first preset value, the screening module is further configured to: and calculating a third distance value between the first simhash value and a fourth simhash value of the target text, and screening out the first text when the third distance value is smaller than or equal to a third preset value.

The fourth simhash value is a simhash value corresponding to the keyword of the target text, and similar texts can be further screened out by comparing the distances between the keyword simhash values of the two texts.

In the actual application process, the abstract text and the summary text can be deduplicated by combining the word segmentation with the keywords, two Simhash values are reserved for each text, one is the Simhash value of the word segmentation, the other is the Simhash value of the keywords, the priority is the word segmentation, and then the keywords are judged, so that the actual effect of the Simhash in the text deduplication screening application can be obviously improved.

In addition, the invention also provides a text screening method. Fig. 3 is a schematic method flow diagram of an embodiment of the text screening method of the present invention. The processor 12 of the electronic device 1, when executing the text filtering program 10 stored in the memory 11, implements the following steps of the text filtering method:

step S10: performing word segmentation operation on a first text to be screened to obtain a plurality of segmented words, extracting keywords with preset parts of speech from the segmented words, and distributing associated weights to the segmented words and the keywords.

Step S20: calculating the hash value of each participle and each keyword by using a hash function, executing weighting operation based on the hash value of each participle and the weight corresponding to each participle to obtain the weight vector of the participle, and executing weighting operation based on the hash value of each keyword and the weight corresponding to each keyword to obtain the weight vector of the keyword.

Step S30: accumulating the weight vectors of the participles to obtain a first weight vector of the first text, accumulating the weight vectors of the keywords to obtain a second weight vector of the first text, and performing dimensionality reduction operation on the first weight vector and the second weight vector to obtain a first simhash value and a second simhash value of the first text.

And then, performing dimensionality reduction operation on the first weight vector and the second weight vector, mapping the high-dimensional feature vector into a low-dimensional feature vector, so as to improve the processing speed, and obtain a first simhash value and a second simhash value of the first text, wherein the first simhash value is a simhash value corresponding to the first text participle, and the second simhash value is a simhash value corresponding to the keyword of the first text, specifically, for the weight vector of the first text, if the weight vector is greater than 0, setting the weight vector to 1, otherwise, setting the weight vector to 0, and thus obtaining the simhash value of the first text. For example, the above calculated "9-91-119" is subjected to dimension reduction (a bit is greater than 0 and set to 1, and a bit less than 0 and set to 0), and the obtained simhash value is: "101011".

Step S40: and calculating a first distance value between the first simhash value and a third simhash value of a preset target text, calculating a second distance value between the second simhash value and the third simhash value when the first distance value is greater than a first preset value, and screening out the first text when the second distance value is less than or equal to a second preset value.

Furthermore, the embodiment of the present invention also provides a computer-readable storage medium, which may be any one or any combination of a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, and the like. The computer readable storage medium includes a storage data area and a storage program area, the storage data area stores data created according to the use of the blockchain node, the storage program area stores a text filtering program 10, and the text filtering program 10 realizes the following operations when being executed by a processor:

In another embodiment, in order to further ensure the privacy and security of all the appearing data, all the data may be stored in a node of a block chain. Such as hash values of text, text that needs to be retained, etc., which may be stored in block chain nodes.

It should be noted that the blockchain in the present invention is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The embodiment of the computer readable storage medium of the present invention is substantially the same as the embodiment of the text screening method, and will not be described herein again.

It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention essentially or contributing to the prior art can be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above and includes several instructions for enabling a terminal device (such as a mobile phone, a computer, an electronic device, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A text screening method is applied to electronic equipment, and is characterized by comprising the following steps:

2. The method of claim 1, wherein the extracting the keywords of the predetermined part of speech from the plurality of segmented words comprises:

3. The method of claim 2, wherein the determining whether the first text has more than a preset number of keywords of a preset part of speech comprises:

4. The method of text screening according to claim 1, further comprising:

5. The text screening method of claim 1 or 4, further comprising:

6. The text screening method of claim 1, wherein when the first distance value is greater than a first preset value, the method further comprises:

7. The method of claim 1, wherein the performing a word segmentation operation on the first text to be filtered to obtain a plurality of word segmentations comprises:

8. A text screening apparatus, characterized in that the apparatus comprises:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

the memory stores a program executable by the at least one processor to enable the at least one processor to perform the text screening method of any one of claims 1 to 7.

10. A computer-readable storage medium, comprising a stored data area storing data created according to use of blockchain nodes and a stored program area storing a text filter, wherein the text filter, when executed by a processor, implements the steps of the text filtering method according to any one of claims 1 to 7.