CN112364625A - Text screening method, device, equipment and storage medium - Google Patents

Text screening method, device, equipment and storage medium Download PDF

Info

Publication number
CN112364625A
CN112364625A CN202011302193.8A CN202011302193A CN112364625A CN 112364625 A CN112364625 A CN 112364625A CN 202011302193 A CN202011302193 A CN 202011302193A CN 112364625 A CN112364625 A CN 112364625A
Authority
CN
China
Prior art keywords
text
value
preset
keywords
simhash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011302193.8A
Other languages
Chinese (zh)
Inventor
董润华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
OneConnect Financial Technology Co Ltd Shanghai
Original Assignee
OneConnect Financial Technology Co Ltd Shanghai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Financial Technology Co Ltd Shanghai filed Critical OneConnect Financial Technology Co Ltd Shanghai
Priority to CN202011302193.8A priority Critical patent/CN112364625A/en
Publication of CN112364625A publication Critical patent/CN112364625A/en
Priority to PCT/CN2021/123907 priority patent/WO2022105497A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data processing technology of big data, and provides a text screening method, a text screening device, text screening equipment and a storage medium. The method includes the steps of performing word segmentation operation on a first text to be screened to obtain a plurality of word segments, extracting keywords with preset word characteristics, distributing weights to the word segments and the keywords, calculating hash values of the word segments and the keywords, obtaining weight vectors of the word segments and the weight vectors of the keywords according to the hash values and the weights, accumulating the weight vectors to obtain a first weight vector and a second weight vector of the text, performing dimensionality reduction on the first weight vector and the second weight vector to obtain a first simhash value and a second simhash value of the text, calculating a distance value between the first simhash value and a third simhash value of a target text, calculating a distance value between the second simhash value and the third simhash value when the distance value is larger than a first preset value, and screening the first text when the distance value is smaller than or equal to a second preset value. The present invention may emphasize abstract or summarized text.

Description

Text screening method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of data processing of big data, in particular to a text screening method, a text screening device, text screening equipment and a storage medium.
Background
When crawling a text, a crawler needs to duplicate the crawled same or extremely high-similarity text, mostly, the text duplicate removal operation uses a URL to generate a fingerprint, the fingerprint is placed in a set for duplicate removal, in the actual application process, one text can be forwarded by a plurality of websites, the fingerprints are different, the text content is the same, and in the duplicate removal operation process, when crawling the abstract text of a certain text or the summary text of the certain text, the abstract or summary text is difficult to duplicate.
Disclosure of Invention
In view of the above, the present invention provides a text screening method, apparatus, device and storage medium, which aims to solve the technical problem that it is difficult to emphasize abstract or summarized text in the prior art.
In order to achieve the above object, the present invention provides a text screening method, including:
performing word segmentation operation on a first text to be screened to obtain a plurality of segmented words, extracting keywords with preset parts of speech from the segmented words, and distributing associated weights to the segmented words and the keywords;
calculating first hash values of the participles and the keywords, performing weighting operation based on the first hash values and the weights of the participles to obtain weight vectors of the participles, and performing weighting operation based on the first hash values and the weights of the keywords to obtain the weight vectors of the keywords;
accumulating the weight vectors of all the participles to obtain a first weight vector of the first text, accumulating the weight vectors of all the keywords to obtain a second weight vector of the first text, and performing dimensionality reduction operation on the first weight vector and the second weight vector to respectively obtain a first simhash value and a second simhash value of the first text;
and calculating a first distance value between the first simhash value and a third simhash value of a target text in a preset storage space, calculating a second distance value between the second simhash value and the third simhash value when the first distance value is larger than a first preset value, and screening out the first text when the second distance value is smaller than or equal to a second preset value.
Preferably, the extracting the keywords with the preset part of speech from the plurality of segmented words includes:
calculating the word frequency of each participle in the first text, calculating the IDF value and the TF value of each participle based on the word frequency, multiplying the IDF value of each participle with the TF value corresponding to each participle to obtain the TF-ID value of each participle, judging whether preset part-of-speech keywords with the number larger than the preset number exist in the first text, if so, selecting the preset part-of-speech keywords with the number larger than the preset number based on the TF-ID value of each participle, wherein the preset part-of-speech keywords comprise noun keywords and verb keywords.
Preferably, the determining whether the first text contains the keywords with the preset parts of speech, where the number of the keywords is greater than a preset number, includes:
and when judging that the first text does not have keywords with preset parts of speech of which the number is larger than the preset number, screening the first text, and randomly acquiring a text from a preset storage space as the first text to be screened to perform word segmentation operation again.
Preferably, the method further comprises:
and when the first distance value is smaller than or equal to the first preset value, screening out the first text.
Preferably, the method further comprises:
and when the second distance value is larger than the second preset value, storing the first text to a text set to which the preset target text belongs.
Preferably, when the first distance value is greater than a first preset value, the method further includes:
and calculating a third distance value between the first simhash value and a fourth simhash value of the target text, and screening out the first text when the third distance value is smaller than or equal to a third preset value.
Preferably, the performing a word segmentation operation on the first text to be filtered to obtain a plurality of words comprises:
matching the read text with the word stock according to a forward maximum matching method to obtain a first matching result, wherein the first matching result comprises a first number of first word groups and a second number of single words;
matching the read text with the word stock according to a reverse maximum matching method to obtain a second matching result, wherein the second matching result comprises a third number of second word groups and a fourth number of single words;
if the first number is equal to the third number and the second number is smaller than or equal to the fourth number, or if the first number is smaller than the third number, taking the first matching result as a word segmentation result of the object full name; and if the first number is equal to the second number and the third number is greater than the fourth number, or if the first number is greater than the third number, taking the second matching result as a word segmentation result of the first text.
In order to achieve the above object, the present invention further provides a text screening apparatus, including:
an extraction module: the system comprises a word segmentation unit, a word segmentation unit and a word segmentation unit, wherein the word segmentation unit is used for performing word segmentation operation on a first text to be screened to obtain a plurality of words, extracting keywords with preset parts of speech from the plurality of words, and distributing associated weights for the words and the keywords;
a weighting module: the system comprises a database, a word segmentation unit, a weighting unit and a processing unit, wherein the database is used for calculating a first hash value of each word segmentation and each keyword, executing weighting operation based on the first hash value and the weight of each word segmentation to obtain a weight vector of each word segmentation, and executing weighting operation based on the first hash value and each weight of each keyword to obtain a weight vector of each keyword;
a dimension reduction module: the system comprises a first text, a second text, a first weight vector, a second weight vector and a first simhash value, wherein the first text is obtained by accumulating the weight vectors of all participles;
a screening module: and the first distance value is used for calculating a first distance value between the first simhash value and a third simhash value of a target text of a preset storage space, when the first distance value is larger than a first preset value, a second distance value between the second simhash value and the third simhash value is calculated, and when the second distance value is smaller than or equal to a second preset value, the first text is screened out.
In order to achieve the above object, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a program executable by the at least one processor to enable the at least one processor to perform any of the steps of the text screening method as described above.
To achieve the above object, the present invention further provides a computer-readable storage medium, which includes a storage data area and a storage program area, the storage data area stores data created according to the use of the blockchain node, the storage program area stores a text filtering program, and the text filtering program, when executed by a processor, implements any of the steps of the text filtering method as described above.
According to the text screening method, the text screening device, the text screening equipment and the storage medium, word segmentation operation is performed on a first text, keywords with preset parts of speech in the first text are extracted, associated weights are distributed to the words and the keywords, corresponding text vectors corresponding to the words and vectors corresponding to the keywords are calculated through a hash function and the weights to obtain corresponding simhash values, then the distances between the simhash values and the preset values are compared, the similarity of the texts is judged through the distances, the identification process of the text similarity can be improved, and when the first text is an abstract text of a certain text or a summarized text of the certain text, the keywords and the words are combined to obtain the simhash values, so that the abstract or summarized text can be accurately subjected to re-adding operation.
Drawings
FIG. 1 is a diagram of an electronic device according to a preferred embodiment of the present invention;
FIG. 2 is a block diagram of a preferred embodiment of the text screening apparatus of FIG. 1;
FIG. 3 is a flow chart of a preferred embodiment of a text screening method of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic diagram of an electronic device 1 according to a preferred embodiment of the invention.
The electronic device 1 includes but is not limited to: memory 11, processor 12, display 13, and network interface 14. The electronic device 1 is connected to a network through a network interface 14 to obtain raw data. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System for Mobile communications (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, or a communication network.
The memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the electronic device 1, such as a hard disk or a memory of the electronic device 1. In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like equipped with the electronic device 1. Of course, the memory 11 may also comprise both an internal memory unit and an external memory device of the electronic device 1. In this embodiment, the memory 11 is generally used for storing an operating system installed in the electronic device 1 and various application software, such as program codes of the text filter 10. Further, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is typically used for controlling the overall operation of the electronic device 1, such as performing data interaction or communication related control and processing. In this embodiment, the processor 12 is configured to run the program code stored in the memory 11 or process data, for example, run the program code of the text filter 10.
The display 13 may be referred to as a display screen or display unit. In some embodiments, the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch screen, or the like. The display 13 is used for displaying information processed in the electronic device 1 and for displaying a visual work interface, e.g. displaying the results of data statistics.
The network interface 14 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), the network interface 14 typically being used for establishing a communication connection between the electronic device 1 and other electronic devices.
FIG. 1 shows only the electronic device 1 having the components 11-14 and the text filter 10, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.
Optionally, the electronic device 1 may further comprise a user interface, the user interface may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further comprise a standard wired interface and a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch screen, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
The electronic device 1 may further include a Radio Frequency (RF) circuit, a sensor, an audio circuit, and the like, which are not described in detail herein.
In the above embodiment, the processor 12, when executing the text filtering program 10 stored in the memory 11, may implement the following steps:
performing word segmentation operation on a first text to be screened to obtain a plurality of segmented words, extracting keywords with preset parts of speech from the segmented words, and distributing associated weights to the segmented words and the keywords;
calculating first hash values of the participles and the keywords, performing weighting operation based on the first hash values and the weights of the participles to obtain weight vectors of the participles, and performing weighting operation based on the first hash values and the weights of the keywords to obtain the weight vectors of the keywords;
accumulating the weight vectors of all the participles to obtain a first weight vector of the first text, accumulating the weight vectors of all the keywords to obtain a second weight vector of the first text, and performing dimensionality reduction operation on the first weight vector and the second weight vector to respectively obtain a first simhash value and a second simhash value of the first text;
and calculating a first distance value between the first simhash value and a third simhash value of a target text in a preset storage space, calculating a second distance value between the second simhash value and the third simhash value when the first distance value is larger than a first preset value, and screening out the first text when the second distance value is smaller than or equal to a second preset value.
The storage device may be the memory 11 of the electronic device 1, or may be another storage device communicatively connected to the electronic device 1.
For a detailed description of the above steps, please refer to the following description of fig. 2 regarding a functional block diagram of an embodiment of the text filtering apparatus 100 and fig. 3 regarding a flowchart of an embodiment of the text filtering method.
Referring to fig. 2, a functional block diagram of the text filtering apparatus 100 according to the present invention is shown.
The text filtering apparatus 100 according to the present invention may be installed in an electronic device. According to the implemented functions, the text filtering apparatus 100 may include an extraction module 110, a weighting module 120, a dimension reduction module 130, and a screening module 140. A module according to the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the extracting module 110 is configured to perform a word segmentation operation on the first text to be filtered to obtain a plurality of words, extract a keyword with a preset part of speech from the plurality of words, and assign a relevant weight to each word and each keyword.
In this embodiment, the technical solution is described by taking an example that the crawled same or extremely similar texts need to be deduplicated when the crawler crawls the texts, and it should be understood that an application scenario of the technical solution is not limited to this. When a certain text is crawled, whether the text is similar to or the same as the crawled text needs to be judged, if so, the text can be screened out, specifically, when a first text to be deduplicated is obtained, a word segmentation operation is performed on the first text to obtain a plurality of participles, keywords with preset parts of speech in the first text are extracted from the plurality of participles, wherein the preset parts of speech keywords can be keywords belonging to nouns and keywords belonging to verbs, associated weights are distributed to the participles and the keywords, and the distributed weights can be distributed according to the number of the participles.
For example, the first text contains a sentence: "author July" of the way of the CSDN blog structure algorithm, after word segmentation: "author July of the way of the CSDN blog structure algorithm", then assigns a weight to each participle: CSDN (4) blog (5) structure (3) (1) method (2) algorithm (3) (1) author (1) track (2) (5) July (5), wherein the number in the brackets represents the importance degree of the word in the whole sentence, and the larger the number is, the more important the word is.
In one embodiment, the extracting the keywords of the preset part of speech from the plurality of segmented words includes:
calculating the word frequency of each participle in the first text, calculating the IDF value and the TF value of each participle based on the word frequency, multiplying the IDF value of each participle with the TF value corresponding to each participle to obtain the TF-ID value of each participle, judging whether preset part-of-speech keywords with the number larger than the preset number exist in the first text, if so, selecting the preset part-of-speech keywords with the number larger than the preset number based on the TF-ID value of each participle, wherein the preset part-of-speech keywords comprise noun keywords and verb keywords.
And counting the occurrence times of all the words in the first text, calculating an IDF (inverse document frequency value), and then calculating a TF (word frequency) value of each word in the first text. And multiplying the IDF value by the TF value to obtain a TF-IDF value of the word, wherein the TF-IDF value can evaluate the importance degree of the word in the speech text, and the larger the TF-IDF value is, the higher the priority of the word is. When the TF-IDF calculation is carried out, the TF-IDF value of a certain word is obtained through the word frequency and the inverse document frequency, if the TF-IDF value is larger, the importance of the word to the text is higher, and therefore the words with the TF-IDF value arranged in front can be used as the keywords of the first text. And judging whether keywords with preset parts of speech of which the number is more than a preset number (for example, 20) exist in the first text, and if so, selecting the noun keywords and verb keywords with the TF-IDF value in the top 20 as the keywords with the preset parts of speech of the first text.
Further, when it is judged that the first text does not have the keywords with the preset parts of speech with the number larger than the preset number, the first text is screened out, and one text is randomly acquired from a preset storage space and used as the first text to be screened to perform word segmentation again.
And when the number of the keywords with the preset parts of speech in the first text is less than 20, deleting the first text, and randomly acquiring a text from a preset storage space as the first text to be screened to perform word segmentation again. The text with insufficient keywords is deleted, further operations such as Hash and dimension reduction can be avoided being performed on the text with insufficient keywords (namely the text with unobvious characteristics), and the duplicate removal speed of massive texts is improved.
In one embodiment, performing a word segmentation operation on a first text to be filtered to obtain a plurality of words comprises:
matching the read text with the word stock according to a forward maximum matching method to obtain a first matching result, wherein the first matching result comprises a first number of first word groups and a second number of single words;
matching the read text with the word stock according to a reverse maximum matching method to obtain a second matching result, wherein the second matching result comprises a third number of second word groups and a fourth number of single words;
if the first number is equal to the third number and the second number is smaller than or equal to the fourth number, or if the first number is smaller than the third number, taking the first matching result as a word segmentation result of the object full name; and if the first number is equal to the second number and the third number is greater than the fourth number, or if the first number is greater than the third number, taking the second matching result as a word segmentation result of the first text.
The segmentation matching results with less single characters and more phrases are found out by simultaneously performing segmentation matching in the forward direction and the reverse direction and are used as the segmentation results of the segmented sentences, so that the segmentation accuracy can be improved.
The weighting module 120 is configured to calculate a first hash value of each participle and each keyword, perform a weighting operation based on the first hash value and the weight of each participle to obtain a weight vector of each participle, and perform a weighting operation based on the first hash value and each weight of each keyword to obtain a weight vector of each keyword.
In this embodiment, a Hash function may be used to calculate a first Hash value of each participle and a first Hash value of each keyword, where the first Hash value is an n-bit signature composed of binary numbers "0" and "1", for example, the Hash value Hash (CSDN) of "CSDN" is "100101", and the Hash value Hash (blog) of "blog" is "101011". And then, according to the hash value of each participle and the weight corresponding to each participle, executing weighting operation to obtain the weight vector of the participle, and according to the hash value of each keyword and the weight corresponding to each keyword, executing weighting operation to obtain the weight vector of the keyword.
Specifically, on the basis of the first Hash value, weighting is performed on each participle and keyword, that is, W is Hash weight, and when 1 is encountered, the Hash value and the weight are multiplied positively, and when 0 is encountered, the Hash value and the weight are multiplied negatively. For example, the weight vector obtained by performing a weighting operation on the hash value "100101" of "CSDN": w (csdn) ═ 1001014 ═ 4-4-44-44, and the hash value "101011" of "blog" is given a weighted vector: w (blog) 1010115-55-555, the remaining participles and keywords perform similar operations.
The dimension reduction module 130 is configured to accumulate the weight vectors of the participles to obtain a first weight vector of the first text, accumulate the weight vectors of the keywords to obtain a second weight vector of the first text, and perform dimension reduction on the first weight vector and the second weight vector to obtain a first simhash value and a second simhash value of the first text, respectively.
In this embodiment, the weight vectors of the participles are accumulated to obtain a first weight vector of the first text, the weight vectors of the keywords are accumulated to obtain a second weight vector of the first text, and the weight vectors of the participles or the keywords are accumulated to obtain a sequence string as the first weight vector or the second weight vector of the first text, for example, the weight vectors of "4-4-44-44" of "CSDN" and "5-55-555" of "blog" are accumulated to obtain "4 + 5-4 + -5-4 + 54 + -5-4 + 54 + 5", and "9-91-11" is obtained.
And then, performing dimensionality reduction operation on the first weight vector and the second weight vector, mapping the high-dimensional feature vector into a low-dimensional feature vector, so as to improve the processing speed, and obtain a first simhash value and a second simhash value of the first text, wherein the first simhash value is a simhash value corresponding to the first text participle, and the second simhash value is a simhash value corresponding to the keyword of the first text, specifically, for the weight vector of the first text, if the weight vector is greater than 0, setting the weight vector to 1, otherwise, setting the weight vector to 0, so as to obtain the first simhash value and the second simhash value of the first text. For example, the above calculated "9-91-119" is subjected to dimension reduction (a bit is greater than 0 and set to 1, and a bit less than 0 and set to 0), and the obtained simhash value is: "101011".
The filtering module 140 is configured to calculate a first distance value between the first simhash value and a third simhash value of a target text in a preset storage space, calculate a second distance value between the second simhash value and the third simhash value when the first distance value is greater than a first preset value, and filter the first text when the second distance value is less than or equal to a second preset value.
In this embodiment, in the actual text deduplication operation process, a text with a high similarity to the text, an abstract text of a certain text, or a summarized text of a certain text, for example, a summarized bulletin of a certain stock may be crawled, and a summarized bulletin text with more detail than the summarized bulletin text may exist, if whether the two texts are similar is judged only according to a first simhash value obtained by text segmentation, a judgment result may be that the two texts are not repeated, and therefore, it is necessary to further judge and compare whether the two texts are similar texts by combining with a second simhash value obtained according to text keywords.
Specifically, a first distance value between the first simhash value and a third simhash value of the target text is calculated, and it is understood that the third simhash value may be a simhash value obtained by particlizing the target text. The first distance value may be a hamming distance value, and when the first distance value is greater than a first preset value (e.g., 3), it may be stated that the two texts are determined to be different or dissimilar according to a first simhash value obtained by text segmentation, at this time, a second distance value between a second simhash value and a third simhash value of the target text may be further calculated, and when the second distance value is less than a second preset value, it may be stated that the two texts belong to similar texts according to a text simhash value obtained by text keywords, at this time, the first text may be filtered out, where the second preset value may be set according to an actual situation. It can be understood that the target text in the preset storage space refers to a text that is similar to or identical to the first file in comparison, and the target text may be a text crawled before the first text or any text in a text set in a database.
In one embodiment, the first text is screened out when the first distance value is less than or equal to the first preset value. When the first distance value of the two texts is smaller than the first preset value, the similarity of the two texts is higher, and the first text can be screened out.
Further, when the second distance value is greater than the second preset value, the first text is stored to a text set to which the preset target text belongs. When the second distance value is larger than the second preset value, the two texts are judged not to belong to the similar texts according to the simhash value obtained by the text key words, so that the first text can be reserved.
In one embodiment, when the first distance value is greater than a first preset value, the screening module is further configured to: and calculating a third distance value between the first simhash value and a fourth simhash value of the target text, and screening out the first text when the third distance value is smaller than or equal to a third preset value.
The fourth simhash value is a simhash value corresponding to the keyword of the target text, and similar texts can be further screened out by comparing the distances between the keyword simhash values of the two texts.
In the actual application process, the abstract text and the summary text can be deduplicated by combining the word segmentation with the keywords, two Simhash values are reserved for each text, one is the Simhash value of the word segmentation, the other is the Simhash value of the keywords, the priority is the word segmentation, and then the keywords are judged, so that the actual effect of the Simhash in the text deduplication screening application can be obviously improved.
In addition, the invention also provides a text screening method. Fig. 3 is a schematic method flow diagram of an embodiment of the text screening method of the present invention. The processor 12 of the electronic device 1, when executing the text filtering program 10 stored in the memory 11, implements the following steps of the text filtering method:
step S10: performing word segmentation operation on a first text to be screened to obtain a plurality of segmented words, extracting keywords with preset parts of speech from the segmented words, and distributing associated weights to the segmented words and the keywords.
In this embodiment, the technical solution is described by taking an example that the crawled same or extremely similar texts need to be deduplicated when the crawler crawls the texts, and it should be understood that an application scenario of the technical solution is not limited to this. When a certain text is crawled, whether the text is similar to or the same as the crawled text needs to be judged, if so, the text can be screened out, specifically, when a first text to be deduplicated is obtained, a word segmentation operation is performed on the first text to obtain a plurality of participles, keywords with preset parts of speech in the first text are extracted from the plurality of participles, wherein the preset parts of speech keywords can be keywords belonging to nouns and keywords belonging to verbs, associated weights are distributed to the participles and the keywords, and the distributed weights can be distributed according to the number of the participles.
For example, the first text contains a sentence: "author July" of the way of the CSDN blog structure algorithm, after word segmentation: "author July of the way of the CSDN blog structure algorithm", then assigns a weight to each participle: CSDN (4) blog (5) structure (3) (1) method (2) algorithm (3) (1) author (1) track (2) (5) July (5), wherein the number in the brackets represents the importance degree of the word in the whole sentence, and the larger the number is, the more important the word is.
In one embodiment, the extracting the keywords of the preset part of speech from the plurality of segmented words includes:
calculating the word frequency of each participle in the first text, calculating the IDF value and the TF value of each participle based on the word frequency, multiplying the IDF value of each participle with the TF value corresponding to each participle to obtain the TF-ID value of each participle, judging whether preset part-of-speech keywords with the number larger than the preset number exist in the first text, if so, selecting the preset part-of-speech keywords with the number larger than the preset number based on the TF-ID value of each participle, wherein the preset part-of-speech keywords comprise noun keywords and verb keywords.
And counting the occurrence times of all the words in the first text, calculating an IDF (inverse document frequency value), and then calculating a TF (word frequency) value of each word in the first text. And multiplying the IDF value by the TF value to obtain a TF-IDF value of the word, wherein the TF-IDF value can evaluate the importance degree of the word in the speech text, and the larger the TF-IDF value is, the higher the priority of the word is. When the TF-IDF calculation is carried out, the TF-IDF value of a certain word is obtained through the word frequency and the inverse document frequency, if the TF-IDF value is larger, the importance of the word to the text is higher, and therefore the words with the TF-IDF value arranged in front can be used as the keywords of the first text. And judging whether keywords with preset parts of speech of which the number is more than a preset number (for example, 20) exist in the first text, and if so, selecting the noun keywords and verb keywords with the TF-IDF value in the top 20 as the keywords with the preset parts of speech of the first text.
Further, when it is judged that the first text does not have the keywords with the preset parts of speech with the number larger than the preset number, the first text is screened out, and one text is randomly acquired from a preset storage space and used as the first text to be screened to perform word segmentation again.
And when the number of the keywords with the preset parts of speech in the first text is less than 20, deleting the first text, and randomly acquiring a text from a preset storage space as the first text to be screened to perform word segmentation again. The text with insufficient keywords is deleted, further operations such as Hash and dimension reduction can be avoided being performed on the text with insufficient keywords (namely the text with unobvious characteristics), and the duplicate removal speed of massive texts is improved.
In one embodiment, performing a word segmentation operation on a first text to be filtered to obtain a plurality of words comprises:
matching the read text with the word stock according to a forward maximum matching method to obtain a first matching result, wherein the first matching result comprises a first number of first word groups and a second number of single words;
matching the read text with the word stock according to a reverse maximum matching method to obtain a second matching result, wherein the second matching result comprises a third number of second word groups and a fourth number of single words;
if the first number is equal to the third number and the second number is smaller than or equal to the fourth number, or if the first number is smaller than the third number, taking the first matching result as a word segmentation result of the object full name; and if the first number is equal to the second number and the third number is greater than the fourth number, or if the first number is greater than the third number, taking the second matching result as a word segmentation result of the first text.
The segmentation matching results with less single characters and more phrases are found out by simultaneously performing segmentation matching in the forward direction and the reverse direction and are used as the segmentation results of the segmented sentences, so that the segmentation accuracy can be improved.
Step S20: calculating the hash value of each participle and each keyword by using a hash function, executing weighting operation based on the hash value of each participle and the weight corresponding to each participle to obtain the weight vector of the participle, and executing weighting operation based on the hash value of each keyword and the weight corresponding to each keyword to obtain the weight vector of the keyword.
In this embodiment, a Hash function may be used to calculate a first Hash value of each participle and a first Hash value of each keyword, where the first Hash value is an n-bit signature composed of binary numbers "0" and "1", for example, the Hash value Hash (CSDN) of "CSDN" is "100101", and the Hash value Hash (blog) of "blog" is "101011". And then, according to the hash value of each participle and the weight corresponding to each participle, executing weighting operation to obtain the weight vector of the participle, and according to the hash value of each keyword and the weight corresponding to each keyword, executing weighting operation to obtain the weight vector of the keyword.
Specifically, on the basis of the first Hash value, weighting is performed on each participle and keyword, that is, W is Hash weight, and when 1 is encountered, the Hash value and the weight are multiplied positively, and when 0 is encountered, the Hash value and the weight are multiplied negatively. For example, the weight vector obtained by performing a weighting operation on the hash value "100101" of "CSDN": w (csdn) ═ 1001014 ═ 4-4-44-44, and the hash value "101011" of "blog" is given a weighted vector: w (blog) 1010115-55-555, the remaining participles and keywords perform similar operations.
Step S30: accumulating the weight vectors of the participles to obtain a first weight vector of the first text, accumulating the weight vectors of the keywords to obtain a second weight vector of the first text, and performing dimensionality reduction operation on the first weight vector and the second weight vector to obtain a first simhash value and a second simhash value of the first text.
In this embodiment, the weight vectors of the participles are accumulated to obtain a first weight vector of the first text, the weight vectors of the keywords are accumulated to obtain a second weight vector of the first text, and the weight vectors of the participles or the keywords are accumulated to obtain a sequence string as the first weight vector or the second weight vector of the first text, for example, the weight vectors of "4-4-44-44" of "CSDN" and "5-55-555" of "blog" are accumulated to obtain "4 + 5-4 + -5-4 + 54 + -5-4 + 54 + 5", and "9-91-11" is obtained.
And then, performing dimensionality reduction operation on the first weight vector and the second weight vector, mapping the high-dimensional feature vector into a low-dimensional feature vector, so as to improve the processing speed, and obtain a first simhash value and a second simhash value of the first text, wherein the first simhash value is a simhash value corresponding to the first text participle, and the second simhash value is a simhash value corresponding to the keyword of the first text, specifically, for the weight vector of the first text, if the weight vector is greater than 0, setting the weight vector to 1, otherwise, setting the weight vector to 0, and thus obtaining the simhash value of the first text. For example, the above calculated "9-91-119" is subjected to dimension reduction (a bit is greater than 0 and set to 1, and a bit less than 0 and set to 0), and the obtained simhash value is: "101011".
Step S40: and calculating a first distance value between the first simhash value and a third simhash value of a preset target text, calculating a second distance value between the second simhash value and the third simhash value when the first distance value is greater than a first preset value, and screening out the first text when the second distance value is less than or equal to a second preset value.
In this embodiment, in the actual text deduplication operation process, a text with a high similarity to the text, an abstract text of a certain text, or a summarized text of a certain text, for example, a summarized bulletin of a certain stock may be crawled, and a summarized bulletin text with more detail than the summarized bulletin text may exist, if whether the two texts are similar is judged only according to a first simhash value obtained by text segmentation, a judgment result may be that the two texts are not repeated, and therefore, it is necessary to further judge and compare whether the two texts are similar texts by combining with a second simhash value obtained according to text keywords.
Specifically, a first distance value between the first simhash value and a third simhash value of the target text is calculated, and it is understood that the third simhash value may be a simhash value obtained by particlizing the target text. The first distance value may be a hamming distance value, and when the first distance value is greater than a first preset value (e.g., 3), it may be stated that the two texts are determined to be different or dissimilar according to a first simhash value obtained by text segmentation, at this time, a second distance value between a second simhash value and a third simhash value of the target text may be further calculated, and when the second distance value is less than a second preset value, it may be stated that the two texts belong to similar texts according to a text simhash value obtained by text keywords, at this time, the first text may be filtered out, where the second preset value may be set according to an actual situation. It can be understood that the target text in the preset storage space refers to a text that is similar to or identical to the first file in comparison, and the target text may be a text crawled before the first text or any text in a text set in a database.
In one embodiment, the first text is screened out when the first distance value is less than or equal to the first preset value. When the first distance value of the two texts is smaller than the first preset value, the similarity of the two texts is higher, and the first text can be screened out.
Further, when the second distance value is greater than the second preset value, the first text is stored to a text set to which the preset target text belongs. When the second distance value is larger than the second preset value, the two texts are judged not to belong to the similar texts according to the simhash value obtained by the text key words, so that the first text can be reserved.
In one embodiment, when the first distance value is greater than a first preset value, the screening module is further configured to: and calculating a third distance value between the first simhash value and a fourth simhash value of the target text, and screening out the first text when the third distance value is smaller than or equal to a third preset value.
The fourth simhash value is a simhash value corresponding to the keyword of the target text, and similar texts can be further screened out by comparing the distances between the keyword simhash values of the two texts.
In the actual application process, the abstract text and the summary text can be deduplicated by combining the word segmentation with the keywords, two Simhash values are reserved for each text, one is the Simhash value of the word segmentation, the other is the Simhash value of the keywords, the priority is the word segmentation, and then the keywords are judged, so that the actual effect of the Simhash in the text deduplication screening application can be obviously improved.
Furthermore, the embodiment of the present invention also provides a computer-readable storage medium, which may be any one or any combination of a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, and the like. The computer readable storage medium includes a storage data area and a storage program area, the storage data area stores data created according to the use of the blockchain node, the storage program area stores a text filtering program 10, and the text filtering program 10 realizes the following operations when being executed by a processor:
performing word segmentation operation on a first text to be screened to obtain a plurality of segmented words, extracting keywords with preset parts of speech from the segmented words, and distributing associated weights to the segmented words and the keywords;
calculating first hash values of the participles and the keywords, performing weighting operation based on the first hash values and the weights of the participles to obtain weight vectors of the participles, and performing weighting operation based on the first hash values and the weights of the keywords to obtain the weight vectors of the keywords;
accumulating the weight vectors of all the participles to obtain a first weight vector of the first text, accumulating the weight vectors of all the keywords to obtain a second weight vector of the first text, and performing dimensionality reduction operation on the first weight vector and the second weight vector to respectively obtain a first simhash value and a second simhash value of the first text;
and calculating a first distance value between the first simhash value and a third simhash value of a target text in a preset storage space, calculating a second distance value between the second simhash value and the third simhash value when the first distance value is larger than a first preset value, and screening out the first text when the second distance value is smaller than or equal to a second preset value.
In another embodiment, in order to further ensure the privacy and security of all the appearing data, all the data may be stored in a node of a block chain. Such as hash values of text, text that needs to be retained, etc., which may be stored in block chain nodes.
It should be noted that the blockchain in the present invention is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The embodiment of the computer readable storage medium of the present invention is substantially the same as the embodiment of the text screening method, and will not be described herein again.
It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention essentially or contributing to the prior art can be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above and includes several instructions for enabling a terminal device (such as a mobile phone, a computer, an electronic device, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A text screening method is applied to electronic equipment, and is characterized by comprising the following steps:
performing word segmentation operation on a first text to be screened to obtain a plurality of segmented words, extracting keywords with preset parts of speech from the segmented words, and distributing associated weights to the segmented words and the keywords;
calculating first hash values of the participles and the keywords, performing weighting operation based on the first hash values and the weights of the participles to obtain weight vectors of the participles, and performing weighting operation based on the first hash values and the weights of the keywords to obtain the weight vectors of the keywords;
accumulating the weight vectors of all the participles to obtain a first weight vector of the first text, accumulating the weight vectors of all the keywords to obtain a second weight vector of the first text, and performing dimensionality reduction operation on the first weight vector and the second weight vector to respectively obtain a first simhash value and a second simhash value of the first text;
and calculating a first distance value between the first simhash value and a third simhash value of a target text in a preset storage space, calculating a second distance value between the second simhash value and the third simhash value when the first distance value is larger than a first preset value, and screening out the first text when the second distance value is smaller than or equal to a second preset value.
2. The method of claim 1, wherein the extracting the keywords of the predetermined part of speech from the plurality of segmented words comprises:
calculating the word frequency of each participle in the first text, calculating the IDF value and the TF value of each participle based on the word frequency, multiplying the IDF value of each participle with the TF value corresponding to each participle to obtain the TF-ID value of each participle, judging whether preset part-of-speech keywords with the number larger than the preset number exist in the first text, if so, selecting the preset part-of-speech keywords with the number larger than the preset number based on the TF-ID value of each participle, wherein the preset part-of-speech keywords comprise noun keywords and verb keywords.
3. The method of claim 2, wherein the determining whether the first text has more than a preset number of keywords of a preset part of speech comprises:
and when judging that the first text does not have keywords with preset parts of speech of which the number is larger than the preset number, screening the first text, and randomly acquiring a text from a preset storage space as the first text to be screened to perform word segmentation operation again.
4. The method of text screening according to claim 1, further comprising:
and when the first distance value is smaller than or equal to the first preset value, screening out the first text.
5. The text screening method of claim 1 or 4, further comprising:
and when the second distance value is larger than the second preset value, storing the first text to a text set to which the preset target text belongs.
6. The text screening method of claim 1, wherein when the first distance value is greater than a first preset value, the method further comprises:
and calculating a third distance value between the first simhash value and a fourth simhash value of the target text, and screening out the first text when the third distance value is smaller than or equal to a third preset value.
7. The method of claim 1, wherein the performing a word segmentation operation on the first text to be filtered to obtain a plurality of word segmentations comprises:
matching the read text with the word stock according to a forward maximum matching method to obtain a first matching result, wherein the first matching result comprises a first number of first word groups and a second number of single words;
matching the read text with the word stock according to a reverse maximum matching method to obtain a second matching result, wherein the second matching result comprises a third number of second word groups and a fourth number of single words;
if the first number is equal to the third number and the second number is smaller than or equal to the fourth number, or if the first number is smaller than the third number, taking the first matching result as a word segmentation result of the object full name; and if the first number is equal to the second number and the third number is greater than the fourth number, or if the first number is greater than the third number, taking the second matching result as a word segmentation result of the first text.
8. A text screening apparatus, characterized in that the apparatus comprises:
an extraction module: the system comprises a word segmentation unit, a word segmentation unit and a word segmentation unit, wherein the word segmentation unit is used for performing word segmentation operation on a first text to be screened to obtain a plurality of words, extracting keywords with preset parts of speech from the plurality of words, and distributing associated weights for the words and the keywords;
a weighting module: the system comprises a database, a word segmentation unit, a weighting unit and a processing unit, wherein the database is used for calculating a first hash value of each word segmentation and each keyword, executing weighting operation based on the first hash value and the weight of each word segmentation to obtain a weight vector of each word segmentation, and executing weighting operation based on the first hash value and each weight of each keyword to obtain a weight vector of each keyword;
a dimension reduction module: the system comprises a first text, a second text, a first weight vector, a second weight vector and a first simhash value, wherein the first text is obtained by accumulating the weight vectors of all participles;
a screening module: and the first distance value is used for calculating a first distance value between the first simhash value and a third simhash value of a target text of a preset storage space, when the first distance value is larger than a first preset value, a second distance value between the second simhash value and the third simhash value is calculated, and when the second distance value is smaller than or equal to a second preset value, the first text is screened out.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a program executable by the at least one processor to enable the at least one processor to perform the text screening method of any one of claims 1 to 7.
10. A computer-readable storage medium, comprising a stored data area storing data created according to use of blockchain nodes and a stored program area storing a text filter, wherein the text filter, when executed by a processor, implements the steps of the text filtering method according to any one of claims 1 to 7.
CN202011302193.8A 2020-11-19 2020-11-19 Text screening method, device, equipment and storage medium Pending CN112364625A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011302193.8A CN112364625A (en) 2020-11-19 2020-11-19 Text screening method, device, equipment and storage medium
PCT/CN2021/123907 WO2022105497A1 (en) 2020-11-19 2021-10-14 Text screening method and apparatus, device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011302193.8A CN112364625A (en) 2020-11-19 2020-11-19 Text screening method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112364625A true CN112364625A (en) 2021-02-12

Family

ID=74532724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011302193.8A Pending CN112364625A (en) 2020-11-19 2020-11-19 Text screening method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112364625A (en)
WO (1) WO2022105497A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254658A (en) * 2021-07-07 2021-08-13 明品云(北京)数据科技有限公司 Text information processing method, system, medium, and apparatus
CN113449073A (en) * 2021-06-21 2021-09-28 福州米鱼信息科技有限公司 Keyword selection method and system
WO2022105497A1 (en) * 2020-11-19 2022-05-27 深圳壹账通智能科技有限公司 Text screening method and apparatus, device, and storage medium
CN114742042A (en) * 2022-03-22 2022-07-12 杭州未名信科科技有限公司 Text duplicate removal method and device, electronic equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484266B (en) * 2016-10-18 2020-02-21 北京字节跳动网络技术有限公司 Text processing method and device
CN107066623A (en) * 2017-05-12 2017-08-18 湖南中周至尚信息技术有限公司 A kind of article merging method and device
CN108776654A (en) * 2018-05-30 2018-11-09 昆明理工大学 One kind being based on improved simhash transcription comparison methods
CN110737748B (en) * 2019-09-27 2023-08-08 成都数联铭品科技有限公司 Text deduplication method and system
CN111339166A (en) * 2020-02-29 2020-06-26 深圳壹账通智能科技有限公司 Word stock-based matching recommendation method, electronic device and storage medium
CN112364625A (en) * 2020-11-19 2021-02-12 深圳壹账通智能科技有限公司 Text screening method, device, equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022105497A1 (en) * 2020-11-19 2022-05-27 深圳壹账通智能科技有限公司 Text screening method and apparatus, device, and storage medium
CN113449073A (en) * 2021-06-21 2021-09-28 福州米鱼信息科技有限公司 Keyword selection method and system
CN113254658A (en) * 2021-07-07 2021-08-13 明品云(北京)数据科技有限公司 Text information processing method, system, medium, and apparatus
CN113254658B (en) * 2021-07-07 2021-12-21 明品云(北京)数据科技有限公司 Text information processing method, system, medium, and apparatus
CN114742042A (en) * 2022-03-22 2022-07-12 杭州未名信科科技有限公司 Text duplicate removal method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2022105497A1 (en) 2022-05-27

Similar Documents

Publication Publication Date Title
CN109614816B (en) Data desensitizing method, device and storage medium
CN108629043B (en) Webpage target information extraction method, device and storage medium
CN112364625A (en) Text screening method, device, equipment and storage medium
US10095780B2 (en) Automatically mining patterns for rule based data standardization systems
CN110321553B (en) Short text topic identification method and device and computer readable storage medium
CN109145216A (en) Network public-opinion monitoring method, device and storage medium
CN107704501B (en) Method and system for identifying homologous binary file
CN111814472B (en) Text recognition method, device, equipment and storage medium
CN109189888B (en) Electronic device, infringement analysis method, and storage medium
CN113095076A (en) Sensitive word recognition method and device, electronic equipment and storage medium
CN109299235B (en) Knowledge base searching method, device and computer readable storage medium
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN111414375A (en) Input recommendation method based on database query, electronic device and storage medium
CN113449187A (en) Product recommendation method, device and equipment based on double portraits and storage medium
CN111339166A (en) Word stock-based matching recommendation method, electronic device and storage medium
US20230205755A1 (en) Methods and systems for improved search for data loss prevention
CN110427453B (en) Data similarity calculation method, device, computer equipment and storage medium
CN112883730B (en) Similar text matching method and device, electronic equipment and storage medium
CN113886708A (en) Product recommendation method, device, equipment and storage medium based on user information
CN112783825A (en) Data archiving method, data archiving device, computer device and storage medium
CN113127621A (en) Dialogue module pushing method, device, equipment and storage medium
CN112395881A (en) Material label construction method and device, readable storage medium and electronic equipment
CN111985212A (en) Text keyword recognition method and device, computer equipment and readable storage medium
CN113282763B (en) Text key information extraction device, equipment and storage medium
US20160117522A1 (en) Probabilistic surfacing of potentially sensitive identifiers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination