CN114385777A - Text data processing method and device, computer equipment and storage medium - Google Patents

Text data processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN114385777A
CN114385777A CN202210041129.1A CN202210041129A CN114385777A CN 114385777 A CN114385777 A CN 114385777A CN 202210041129 A CN202210041129 A CN 202210041129A CN 114385777 A CN114385777 A CN 114385777A
Authority
CN
China
Prior art keywords
text
target text
target
candidate
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210041129.1A
Other languages
Chinese (zh)
Inventor
李鹏宇
李剑锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210041129.1A priority Critical patent/CN114385777A/en
Publication of CN114385777A publication Critical patent/CN114385777A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a text data processing method, a text data processing device, a computer device and a storage medium, wherein the method comprises the following steps: acquiring a target text and determining the length of the target text; if the length of the target text is larger than a preset length threshold, performing abstract extraction on the target text to obtain the target text of which the length is smaller than or equal to the length threshold; based on the pre-established inverted index of the candidate text, searching the candidate text related to the keywords from the candidate text according to the keywords of the target text; the pre-established inverted index of the candidate text comprises the following steps: performing statistical calculation on the candidate texts by using an inverted index algorithm to form a word-inverted arrangement list, and storing the result into a database; and determining whether the target text and the candidate text coincide based on the Hamming distance between the target text and each candidate text. The method can improve the accuracy of text duplicate checking.

Description

Text data processing method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a text data processing method and apparatus, a computer device, and a storage medium.
Background
The development of information technology, accompanied by the tremendous enrichment of data, inevitably brings problems to the information processing system, such as the large amount of repetition of text data.
At present, when a large-scale text is oriented to text duplication checking, two main methods are available:
(1) the text is represented as a fixed-length vector based on the simhash code, then repeated text is found by using text similarity, and the re-judging operation is accelerated based on the index. Because the simhash code has poor representation capability on short texts, the duplication removal of the short texts (with small word counts, such as micro blogs, comments, notes and the like) cannot be supported well.
(2) Text is represented as fixed-length distributed vectors based on a neural language model, and retrieval is accelerated based on an index. The neural language is good at the semantics and grammar of the text and has good performance in the similarity calculation and judgment repetition of the long text and the short text. However, such schemes are relatively expensive to build or operate: (a) if the index is constructed based on the word segmentation result, the search strategy needs to be designed and the knowledge base needs to be built by consuming more human resources, so that the satisfactory effect and efficiency level can be achieved; (b) if the index is constructed based on the distributed vector, the text is grouped in advance by using a clustering mode, which provides a challenge to the updating algorithm and performance of the index; (c) neural language models consume significant computational complexity and do not provide real-time computation when processing long text.
Therefore, the existing text duplicate checking scheme can not well support the duplicate removal of short texts, or the construction and running cost of the model is higher, so that the speed of online duplicate checking calculation is slower, and real-time calculation can not be provided when long texts exist.
Disclosure of Invention
The application provides a text data processing method, a text data processing device, computer equipment and a storage medium, and aims to realize online duplicate checking of long texts and short texts and improve duplicate checking speed.
A first aspect provides a text data processing method, the method comprising:
acquiring a target text and determining the length of the target text;
if the length of the target text is larger than a preset length threshold, performing abstract extraction on the target text to obtain the target text of which the length is smaller than or equal to the length threshold;
based on a pre-established inverted index of candidate texts, retrieving candidate texts related to keywords from the candidate texts according to the keywords of the target texts; wherein the pre-established inverted index of candidate texts comprises: performing statistical calculation on the candidate text by using an inverted index algorithm to form a word-inverted arrangement table, and storing the result into a database;
determining whether the target text and the candidate text coincide based on a hamming distance between the target text and each of the candidate texts.
In some embodiments, after determining whether the target text and the candidate text coincide based on the hamming distance between the target text and each of the candidate texts, further comprising:
and if the target text is not overlapped with any candidate text, generating an inverted index of the target text.
In some embodiments, the abstracting the target text to obtain the target text with the length less than or equal to the length threshold includes:
performing word segmentation on the target text, and calculating the weight of each sentence based on a textrank algorithm;
and splicing the sentences with the highest weight according to the sequence of the weights from large to small to obtain the target text with the length less than or equal to the length threshold.
In some embodiments, the retrieving, from the candidate texts based on the pre-established inverted index of the candidate texts according to the keywords of the target text, candidate texts related to the keywords includes:
carrying out vector representation on the target text to obtain a target vector;
based on the keywords, performing word segmentation on the target text;
for the segmented target text, starting from the first word of the segmented target text, taking words with a preset window length, moving with a preset step length until the last word of the target text, and obtaining a plurality of continuous subsequences, wherein each subsequence comprises an inverted file of the target text;
and acquiring related candidate texts of the target text based on the inverted indexes of the candidate texts, wherein the candidate texts have the most inverted files which are the same as the target text.
In some embodiments, the vector representing the target text to obtain a target vector includes:
inputting the target text into a pre-trained BERT model to perform vector conversion to obtain a first vector;
converting the first vector into a target vector in a preset format based on a random projection method;
the preset format of the target vector is that the length is a preset length, and the value of an element is 0 or 1.
In some embodiments, the transforming the first vector into a target vector in a preset format based on a stochastic projection method includes:
performing average pooling on the output of the last layer of Transformer of the pre-trained BERT model, and randomly generating a random vector p with a preset length based on normal distribution1,p2,……pi
Comparing the each vector value in the first vector with the each vector value in the random vector according to the position to obtain a one-dimensional vector q with a preset lengthiWherein, in the step (A),
Figure BDA0003470260030000041
summing the random vectors to obtain v'1
Calculating to obtain a target vector v2Wherein
Figure BDA0003470260030000042
In some embodiments, said determining whether said target text is repeated with said candidate text based on a hamming distance between each of said candidate texts and said target text comprises:
respectively representing each candidate text as a target vector with a preset length;
for each of the candidate texts, calculating a hamming distance between a target vector of the target text and a target vector of the candidate text;
and if the hamming distance is smaller than a set hamming distance threshold value, determining that the candidate text is a repeated text of the target text.
A second aspect provides a text data processing apparatus comprising:
the input unit is used for acquiring a target text and determining the length of the target text;
the cutting unit is used for extracting the abstract of the target text to obtain the target text of which the length is less than or equal to a preset length threshold value if the length of the target text is greater than the preset length threshold value;
the retrieval unit is used for retrieving candidate texts related to the keywords from the candidate texts according to the keywords of the target texts based on the pre-established inverted indexes of the candidate texts; wherein the pre-established inverted index of candidate texts comprises: performing statistical calculation on the candidate text by using an inverted index algorithm to form a word-inverted arrangement table, and storing the result into a database;
a determination unit configured to determine whether the target text and the candidate text coincide with each other based on a hamming distance between the target text and each of the candidate texts.
A third aspect provides a computer device comprising a memory and a processor, the memory having stored therein computer-readable instructions which, when executed by the processor, cause the processor to perform the steps of the text data processing method described above.
A fourth aspect provides a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the text data processing method described above.
According to the technical scheme, a target text is obtained firstly, and the length of the target text is determined; then, if the length of the target text is larger than a preset length threshold, performing abstract extraction on the target text to obtain the target text of which the length is smaller than or equal to the length threshold; then based on the pre-established inverted index of the candidate text, searching the candidate text related to the keywords from the candidate text according to the keywords of the target text; determining whether the target text and the candidate text coincide based on a hamming distance between the target text and each of the candidate texts. Therefore, in the application, the abstract extraction is performed on the long text (the length is greater than the preset length threshold) to form the short text, so that the method can adapt to the scene with the simultaneous existence of the long text and the short text; based on the pre-established inverted index of the candidate text, the candidate text related to the key words is retrieved from the candidate text, the efficiency of full-text search of a search engine is improved, and therefore the speed of initialization and online calculation is high; and determining whether the target text and the candidate text are coincident or not based on the hamming distance between the target text and each candidate text, wherein the hamming distance between the feature vector of the target text and the feature vector of the candidate text is substantially determined, and the workload of manually constructing the features is greatly reduced even if the text representation becomes a data-driven task.
Drawings
FIG. 1 is a diagram of an implementation environment of a method for processing text data, as provided in one embodiment;
FIG. 2 is a flowchart of a text data processing method in one embodiment;
FIG. 3 is another flow diagram of a method for text data processing in one embodiment;
FIG. 4 is a block diagram showing a configuration of a text data processing apparatus according to an embodiment;
FIG. 5 is a block diagram showing an internal configuration of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another.
Fig. 1 is a diagram of an implementation environment of a text duplication checking processing method provided in an embodiment, as shown in fig. 1, in the implementation environment, a computer device 110 and a terminal 120 may be included.
The computer device 110 is a data provider device, and the computer device 110 has an Interface, which may be, for example, an API (Application Programming Interface). The terminal 120 is a duplication checking request input party and has an interface configuration interface, and when the text duplication checking process is performed, the user can input a duplication checking request through the terminal 120 so as to enable the computer device 110 to perform the next text duplication checking process.
It should be noted that the terminal 120 and the computer device 110 may be, but are not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The computer device 110 and the terminal 110 may be connected through bluetooth, USB (Universal Serial Bus), or other communication connection methods, which is not limited herein.
FIG. 5 is a diagram showing an internal configuration of a computer device according to an embodiment. As shown in fig. 2, the computer device may include a processor, a storage medium, a memory, and a network API interface connected by a system bus. The storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can enable the processor to realize a text duplication checking processing method when being executed by the processor. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, cause the processor to perform a text deduplication processing method. The network API interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 2 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
As shown in fig. 3, in an embodiment, a text data processing method is provided, which may be applied to the computer device 110, and specifically includes the following steps:
step 201, obtaining a target text and determining the length of the target text;
the target text is the text D which needs to be determined heavily. The target text is expressed as a target vector with a preset length, and in order to judge whether the target text and the candidate text are repeated texts, the target text needs to be converted into a preset format, that is, the text to be repeated is a short text, and the length is less than or equal to a preset length threshold.
It can be understood that, whether the target text is a long text or a short text is judged, and if the length of the target text is greater than the length threshold, the target text is the long text; and if the length of the target text is less than or equal to the length threshold value, the target text is short text. For example, the text length is long text when it exceeds 510, and short text when it is equal to or less than 510.
Step 202, if the length of the target text is greater than a preset length threshold, performing abstract extraction on the target text to obtain the target text with the length less than or equal to the length threshold.
In general, a text excerpt may describe exactly the central content of the original text in a concise and semantically coherent language. In the step, the short text is obtained by abstracting the long text, and the short text is used as the target text for text duplication checking.
In some embodiments, in the step 202, performing abstract extraction on the target text to obtain the target text with a length less than or equal to the length threshold may include:
step 202a, performing word segmentation on a target text, and calculating the weight of each sentence based on a textrank algorithm;
and step 202b, splicing the sentences with the highest weight according to the sequence of the weights from large to small to obtain a target text with the length smaller than or equal to a length threshold value.
In this embodiment, among other things, the TextRank algorithm is based on PageRank for generating keywords and summaries for text. The method comprises the steps of dividing a text into a plurality of composition units (words and sentences), establishing a graph model, sequencing important components in the text by using a voting mechanism, and realizing keyword extraction and abstract only by using the information of a single text.
Figure BDA0003470260030000091
Wherein d is a damping coefficient, the value range is 0 to 1, which represents the probability of pointing to any other point from a certain point in the graph, and the value is generally 0.85. When calculating the score of each point in the graph by using the TextRank algorithm, it is necessary to assign an arbitrary initial value to the point in the graph and recursively calculate until convergence is reached, that is, when the error rate of any point in the graph is less than a given limit value, the limit value is generally 0.0001.
It is explained with the idea of PageRank: if a word appears after many words, the word is more important, and a word with a high TextRank value is followed by a word, and the TextRank value of the word is accordingly increased. Thus, the formula of TextRank can be rewritten by the formula of PageRank as:
each sentence in the text is regarded as a node, if the two sentences have similarity, the corresponding node of the two sentences is regarded as having an undirected rightAnd (7) edge. The calculation of sentence similarity is shown above, WSi、WSjTwo sentences, WkRepresenting words in a sentence, the numerator represents the number of words that appear in both sentences at the same time, and the denominator is the logarithmic sum of the number of words in the sentence. The use of a logarithm of the denominator can offset the superiority of long sentences in similarity calculations (long sentences are more likely to contain the same word). And (3) according to the similarity formula, carrying out circular iterative computation to obtain the similarity between any two nodes, constructing a node connection graph, finally calculating PR values, and sorting to select the sentence corresponding to the node with the highest PR value as the abstract.
Step 203, based on the pre-established inverted index of the candidate text, according to the keywords of the target text, retrieving candidate texts related to the keywords from the candidate texts; the pre-established inverted index of the candidate text comprises the following steps: and performing statistical calculation on the candidate texts by using an inverted index algorithm to form a word-inverted list, and storing the result into a database.
The inverted index is a technology for searching records according to values of attributes, is usually used in the field of information retrieval, is used as a most commonly used data structure and is used for storing the mapping of a storage position of a word in a text or a group of texts under full-text search, and a text list containing a keyword can be quickly acquired according to the keyword through the inverted index so as to generate a search result to be fed back to a user, so that the full-text search efficiency of a search engine is improved.
As shown in table 1 below, the text list is a text list formed by numbering (DocID) original text data;
Figure BDA0003470260030000101
as shown in table 2 below, the data in the text is segmented to obtain entries. The entries are numbered to create an index with the entries. Wherein the content of the first and second substances,
word ID: recording the word number of each word;
the word: a corresponding word;
text frequency: representing how many texts in the text set contain a word;
inverted arrangement table: contains word ID and other necessary information;
DocId: text id of the occurrence of the word;
TF: the number of times a word appears in a certain text;
POS: the position where the word appears in the text;
taking the word "franchise" as an example, the word number is 6, the text frequency is 3, which means that three texts in the whole text set contain the word, the corresponding inverted arrangement tables are { (2; 1; <4>), (3; 1; <7>), (5; 1; <5>) }, which means that the word appears in the texts 2, 3, 5, and 1 time in each text, the word "franchise" is 4 at the POS of the first text, i.e. the fourth word of the text is "franchise", and the others are similar.
Figure BDA0003470260030000111
In some embodiments, in step 203, retrieving candidate texts related to the keywords from the candidate texts according to the keywords of the target text based on the pre-established inverted index of the candidate texts may include:
step 203a, performing vector representation on the target text to obtain a target vector;
in some embodiments, this step 203a may comprise:
step 203a1, inputting a target text into a pre-trained BERT model for vector conversion to obtain a first vector;
the method comprises the steps of coding a text by using a pre-trained BERT, averaging and pooling (mean pooling) output of a last layer of transform to obtain a first vector with a preset length, and using the first vector as a primary representation of a target text.
Step 203a2, converting the first vector into a target vector with a preset format based on a random projection method;
this step 203a2 may include:
the output of the last layer of Transformer of the pre-trained BERT model is subjected to average pooling, and a random vector p with the preset length is randomly generated based on normal distribution1,p2,……pi
Comparing the each vector value in the first vector with the each vector value in the random vector according to the position to obtain a one-dimensional vector q with a preset lengthiWherein, in the step (A),
Figure BDA0003470260030000121
summing the random vectors to obtain v'1
Calculating to obtain a target vector v2Wherein
Figure BDA0003470260030000122
The preset format of the target vector is that the length is a preset length, and the value of an element is 0 or 1.
In a specific application scenario, the format of the target vector is: (1) a length of 768; (2) the value of the element is 0 or 1.
Step 203b, segmenting the target text based on the keywords;
wherein the target vector is segmented into K consecutive subsequences of target length subs ═ vsub1,vsub2,…vsubn,…vsubN) (ii) a For example, segmenting a textual representation into lengths K
Figure BDA0003470260030000131
A contiguous subsequence.
Step 203c, for the segmented target text, starting from the first word of the segmented target text, taking words by a preset window length, and moving by a preset step length until the last word of the target text to obtain a plurality of continuous subsequences, wherein each subsequence comprises an inverted file of the target text;
wherein, the preset window length can be the number of words; the preset step size may also be the number of words.
Step 203d, obtaining the related candidate text of the target text based on the inverted index of the candidate text, wherein the candidate text has the most inverted files same as the target text.
Wherein, the retrieval mode is as follows: traverse v2Querying the nth inverted index for the key of vsubnThe value of (c) is collected to form a union, which is the search result cand-list.
It can be understood that after the query word of the target text is obtained, the text identifier and/or the paragraph identifier corresponding to the keyword can be quickly found according to the pre-established inverted index, the text and/or the paragraph corresponding to the keyword is ordered according to the relevance with the query word, and the text and/or the paragraph corresponding to the query word is generated into a list according to the ordering result and sent to the terminal of the user for display.
And step 204, determining whether the target text and the candidate text are overlapped or not based on the hamming distance between the target text and each candidate text.
In the embodiment, the text to be detected is filtered and screened based on the hamming distance between each candidate text and the target text and by setting the hamming distance threshold, so that the range of the text to be detected is narrowed, that is, objects participating in calculation of the hamming distance are reduced, and the purpose of improving the detection efficiency is achieved.
In some embodiments, step 204 may include:
step 204a, representing each candidate text as a target vector with a preset length;
the target vector with the preset length is in the same format as the vector format of the target text.
Step 204b, calculating the Hamming distance between the target vector of the target text and the target vector of the candidate text aiming at each candidate text;
and determining whether the target text and the candidate text are overlapped (have the same semantic meaning) according to the hamming distance between the simhash value of the target text and the simhash value of the candidate text.
And step 204c, if the hamming distance is smaller than the set hamming distance threshold, determining that the candidate text is the quasi-similar text of the target text.
In an application scene, whether the target text and the candidate text are semantically identical is determined according to one or a combination of the hamming distance between the simhash value of the target text and the simhash value of the candidate text, the contact degree between the named entity of the target text and the named entity of the candidate text, the contact degree between the keywords of the target text and the keywords of the candidate text, and the contact degree between the target text classification of the target text and the target text classification of the candidate text.
Specific examples are:
cosine similarity between the target text and the candidate text is greater than 0.8, otherwise, filtering;
the simhash value between the target text and the candidate text is less than 20, and the simhash value is directly adopted;
and if the simhash value between the target text and the candidate text is more than or equal to 20, filtering by adopting the following method:
and extracting keywords from the title and the abstract by using a rank mode of an LAC (Baidu word segmentation device), wherein the keywords with the score of 3 are core keywords and the keywords with the score of 2 are important keywords.
If the newly introduced target text contains the core keywords and all the candidate texts compared with the core keywords contain any one of the core keywords, directly adopting the method;
if the newly introduced target text has no core keywords, the important keywords are compared.
And all the candidate texts compared with the candidate texts contain any 2 important keywords, and the candidate texts are directly adopted.
In this step, the inverted index is a technology for searching for records according to the value of the attribute, and generally, the field of information retrieval is used to accelerate the efficiency of full-text search of a search engine.
In some embodiments, after determining whether the target text and the candidate text coincide based on the hamming distance between the target text and each candidate text, the method further comprises:
and if the target text is not overlapped with any candidate text, generating an inverted index of the target text.
The method comprises the following steps of constructing N inverted indexes, wherein the construction mode of the nth inverted index is as follows: nth continuous subsequence vsub represented in target textnIs key; value of a key in which the nth subsequence of the representation of text equals vsubn
It can be understood that as the text is continuously updated, the content of the inverted index is also updated, i.e. the process of creating the inverted index may include creating and updating the created inverted index.
The text data processing method judges whether the document in the data stream is recorded or not, and assigns the judgment result to the corresponding field of the data so that a downstream module can determine whether the document is used or not. For example, based on the repeatability decision, the index is updated or downstream module behavior is intervened.
As shown in fig. 4, in one embodiment, a text data processing apparatus is provided, which may be integrated in the computer device 110, and may specifically include
An input unit 411, configured to obtain a target text and determine a length of the target text;
a cutting unit 412, configured to, if the length of the target text is greater than a preset length threshold, perform abstract extraction on the target text to obtain a target text with a length less than or equal to the length threshold;
a retrieving unit 413, configured to retrieve, based on an inverted index of a pre-established candidate text, a candidate text related to a keyword from the candidate text according to the keyword of the target text; wherein the pre-established inverted index of candidate texts comprises: performing statistical calculation on the candidate text by using an inverted index algorithm to form a word-inverted arrangement table, and storing the result into a database;
a determining unit 414, configured to determine whether the target text and the candidate text coincide with each other based on a hamming distance between the target text and each of the candidate texts.
FIG. 5 is a diagram showing an internal configuration of a computer device according to an embodiment. As shown in fig. 5, the computer device includes a processor, a storage medium, a memory, and a network API interface connected by a system bus. The storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can make a processor realize a text data processing method when being executed by the processor. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform a method of text data processing. The network API interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is proposed, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: acquiring a target text, and expressing the target text as a target vector with a preset length; retrieving candidate texts of the target text based on the target vector and the inverted index information of the candidate texts in the inverted index table; the reverse index table stores reverse index information of at least one candidate text, and the reverse index information of the candidate text comprises: key and key value of the candidate text; determining whether the target text is repeated with the candidate text based on a hamming distance between each candidate text and the target text.
In one embodiment, a storage medium is provided that stores computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of: acquiring a target text, and expressing the target text as a target vector with a preset length; retrieving candidate texts of the target text based on the target vector and the inverted index information of the candidate texts in the inverted index table; the reverse index table stores reverse index information of at least one candidate text, and the reverse index information of the candidate text comprises: key and key value of the candidate text; determining whether the target text is repeated with the candidate text based on a hamming distance between each candidate text and the target text.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of processing text data, the method comprising:
acquiring a target text and determining the length of the target text;
if the length of the target text is larger than a preset length threshold, performing abstract extraction on the target text to obtain the target text of which the length is smaller than or equal to the length threshold;
based on a pre-established inverted index of candidate texts, retrieving candidate texts related to keywords from the candidate texts according to the keywords of the target texts; wherein the pre-established inverted index of candidate texts comprises: performing statistical calculation on the candidate text by using an inverted index algorithm to form a word-inverted arrangement table, and storing the result into a database;
determining whether the target text and the candidate text coincide based on a hamming distance between the target text and each of the candidate texts.
2. The text data processing method according to claim 1, further comprising, after determining whether the target text and the candidate texts coincide based on a hamming distance between the target text and each of the candidate texts:
and if the target text is not overlapped with any candidate text, generating an inverted index of the target text.
3. The method of claim 1, wherein the abstracting the target text to obtain the target text with a length less than or equal to the length threshold comprises:
performing word segmentation on the target text, and calculating the weight of each sentence based on a textrank algorithm;
and splicing the sentences with the highest weight according to the sequence of the weights from large to small to obtain the target text with the length less than or equal to the length threshold.
4. The method according to claim 1, wherein the retrieving candidate texts related to the keywords from the candidate texts according to the keywords of the target text based on the pre-established inverted index of the candidate texts comprises:
carrying out vector representation on the target text to obtain a target vector;
based on the keywords, performing word segmentation on the target text;
for the segmented target text, starting from the first word of the segmented target text, taking words with a preset window length, moving with a preset step length until the last word of the target text, and obtaining a plurality of continuous subsequences, wherein each subsequence comprises an inverted file of the target text;
and acquiring related candidate texts of the target text based on the inverted indexes of the candidate texts, wherein the candidate texts have the most inverted files which are the same as the target text.
5. The method according to claim 4, wherein the vector-representing the target text to obtain a target vector comprises:
inputting the target text into a pre-trained BERT model to perform vector conversion to obtain a first vector;
converting the first vector into a target vector in a preset format based on a random projection method;
the preset format of the target vector is that the length is a preset length, and the value of an element is 0 or 1.
6. The text data processing method according to claim 5, wherein the transforming the first vector into a target vector in a preset format based on a stochastic projection method includes:
performing average pooling on the output of the last layer of Transformer of the pre-trained BERT model, and randomly generating a random vector p with a preset length based on normal distribution1,p2,……pi
Comparing by bit the firstObtaining a one-dimensional vector q with a preset length by the each vector value in the vector and the each vector value in the random vectoriWherein, in the step (A),
Figure FDA0003470260020000031
summing the random vectors to obtain v'1
Calculating to obtain a target vector v2Wherein
Figure FDA0003470260020000032
7. The text data processing method of claim 1, wherein the determining whether the target text is repeated with the candidate text based on the hamming distance between each of the candidate texts and the target text comprises:
respectively representing each candidate text as a target vector with a preset length;
for each of the candidate texts, calculating a hamming distance between a target vector of the target text and a target vector of the candidate text;
and if the hamming distance is smaller than a set hamming distance threshold value, determining that the candidate text is a repeated text of the target text.
8. A text data processing apparatus, characterized by comprising:
the input unit is used for acquiring a target text and determining the length of the target text;
the cutting unit is used for extracting the abstract of the target text to obtain the target text of which the length is less than or equal to a preset length threshold value if the length of the target text is greater than the preset length threshold value;
the retrieval unit is used for retrieving candidate texts related to the keywords from the candidate texts according to the keywords of the target texts based on the pre-established inverted indexes of the candidate texts; wherein the pre-established inverted index of candidate texts comprises: performing statistical calculation on the candidate text by using an inverted index algorithm to form a word-inverted arrangement table, and storing the result into a database;
a determination unit configured to determine whether the target text and the candidate text coincide with each other based on a hamming distance between the target text and each of the candidate texts.
9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to carry out the steps of the text data processing method according to any one of claims 1 to 7.
10. A storage medium storing computer readable instructions which, when executed by a processor, cause the processor to carry out the steps of the text data processing method according to any one of claims 1 to 7.
CN202210041129.1A 2022-01-14 2022-01-14 Text data processing method and device, computer equipment and storage medium Pending CN114385777A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210041129.1A CN114385777A (en) 2022-01-14 2022-01-14 Text data processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210041129.1A CN114385777A (en) 2022-01-14 2022-01-14 Text data processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114385777A true CN114385777A (en) 2022-04-22

Family

ID=81202314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210041129.1A Pending CN114385777A (en) 2022-01-14 2022-01-14 Text data processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114385777A (en)

Similar Documents

Publication Publication Date Title
US11573996B2 (en) System and method for hierarchically organizing documents based on document portions
JP5346279B2 (en) Annotation by search
US20180107933A1 (en) Web page training method and device, and search intention identifying method and device
CN104199965B (en) Semantic information retrieval method
CN110569496B (en) Entity linking method, device and storage medium
CN109657053B (en) Multi-text abstract generation method, device, server and storage medium
CN112800170A (en) Question matching method and device and question reply method and device
US20100318532A1 (en) Unified inverted index for video passage retrieval
CN111985228B (en) Text keyword extraction method, text keyword extraction device, computer equipment and storage medium
JP5710581B2 (en) Question answering apparatus, method, and program
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN110750704B (en) Method and device for automatically completing query
CN111159359A (en) Document retrieval method, document retrieval device and computer-readable storage medium
CN111090731A (en) Electric power public opinion abstract extraction optimization method and system based on topic clustering
Jin et al. Entity linking at the tail: sparse signals, unknown entities, and phrase models
CN114880447A (en) Information retrieval method, device, equipment and storage medium
CN111090771A (en) Song searching method and device and computer storage medium
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN114756733A (en) Similar document searching method and device, electronic equipment and storage medium
CN110727769A (en) Corpus generation method and device, and man-machine interaction processing method and device
CN113111178B (en) Method and device for disambiguating homonymous authors based on expression learning without supervision
JP4325370B2 (en) Document-related vocabulary acquisition device and program
CN108345679B (en) Audio and video retrieval method, device and equipment and readable storage medium
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
CN111859079B (en) Information searching method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination