CN113139374A - Method, system, equipment and storage medium for querying marks of document similar paragraphs - Google Patents
Method, system, equipment and storage medium for querying marks of document similar paragraphs Download PDFInfo
- Publication number
- CN113139374A CN113139374A CN202110388914.XA CN202110388914A CN113139374A CN 113139374 A CN113139374 A CN 113139374A CN 202110388914 A CN202110388914 A CN 202110388914A CN 113139374 A CN113139374 A CN 113139374A
- Authority
- CN
- China
- Prior art keywords
- text
- length
- similarity
- document
- paragraphs
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method, a system, equipment and a storage medium for marking and inquiring document similar paragraphs, wherein the method comprises the following steps: judging whether the length of the marked text is greater than a first length threshold value or not; if the length of the marked text is smaller than the first length threshold value, matching the documents in the document library according to the marked text to obtain a query result and outputting the query result; or; if the length of the marked text is larger than the first length threshold value, the document in the document library is subjected to paragraph segmentation, and then a query result is obtained through similarity comparison and output. The method divides the marked texts into different types of marked texts according to different lengths, and adopts different matching strategies aiming at the marked texts with different lengths, so that the query result is more accurate.
Description
Technical Field
The invention relates to the technical field of data analysis, in particular to a method, a system, equipment and a storage medium for marking and inquiring similar paragraphs of a document.
Background
Today, many businesses have large volumes of document text data, including product manuals, business contracts, deployment documentation, and so forth, which are highly specialized documents. In order to facilitate unified management, many companies centralize the document data and provide intelligent services such as query, reading, recommendation and the like. The automatic query matching service of the text similar paragraphs can help the user to better utilize the text resources in the document library, and the value of the document resources is improved. The basic functions of the automatic query matching service for text similar paragraphs are as follows: the user manually marks a segment of text during reading, and after marking, the system automatically matches segments similar to the marked segment content from all the documents in the document library in the background by using NLP and other related technologies and returns the segments to the user. The user can find paragraphs or texts with similar contents as references according to the matching result.
The prior art is mostly a solution similar to text duplication checking. For example, SimHash, the approximate calculation is as follows:
1. extracting features and weights corresponding to the features from the document;
2. hash is carried out on the characteristics to generate corresponding hash values;
hash value weighting: and (3) performing cyclic processing on each bit of the characteristic hash value: if the bit value is 1, replacing the bit value by weight, otherwise, replacing the bit value by-weight;
4. and (3) summing: summing the weighted results of the feature hash according to bits, and then binarizing the results according to bits: if the value is more than 0, the value is 1, otherwise, the value is 0, and the final SimHash value is obtained.
And after obtaining the SimHash values of the documents, calculating the Hamming distance of the SimHash values of the two documents as the similarity of the two documents.
However, SimHash itself is an algorithm used by Google for removing the duplication of massive web pages, and is suitable for similarity calculation of the whole document. For shorter text passages, the SimHash often cannot achieve good effect. In addition, the SimHash does not take semantic information of the text into consideration, and for a language environment such as chinese where the expression mode is very flexible and only relates to one or several concepts, but not the case where large segments of contents are similar, the SimHash cannot obtain an accurate similarity result.
Disclosure of Invention
The invention provides a method, a system, equipment and a storage medium for marking and querying document similar paragraphs, aiming at the technical problems that the prior art cannot perform similarity calculation on shorter text paragraphs and does not consider semantic information of texts.
In a first aspect, an embodiment of the present application provides a method for querying a document similarity paragraph through a tag, including:
length determination step S1: judging whether the length of the marked text is greater than a first length threshold value or not;
query result obtaining step S2: if the length of the marked text is smaller than the first length threshold, matching the documents in the document library according to the marked text to obtain a query result and outputting the query result; or;
query result obtaining step S2': if the length of the marked text is larger than the first length threshold, the document in the document library is subjected to paragraph segmentation, and then a query result is obtained through similarity comparison and output.
The method for querying for tags of similar paragraphs in documents as described above, wherein the query result obtaining step S2 includes: and if the length of the tagged text is smaller than the first length threshold, searching the tagged text in all the documents in the document library, and taking the sentence where the tagged text is located, the position of the sentence in the document and the corresponding document name as query results and outputting the query results.
The method for querying for tags of similar paragraphs in documents as described above, wherein the step of obtaining query results S2' includes:
dividing step S21': performing paragraph segmentation on the document according to the length of the mark text to obtain a plurality of segmented text paragraphs;
similarity calculation step S22': calculating the similarity between the mark text and the segmentation text paragraphs according to the length of the mark text to obtain a plurality of similarities;
similarity comparison step S23': and comparing the similarity with a similarity threshold, and then taking the segmented text paragraphs with the similarity higher than the similarity threshold, the positions of the segmented text paragraphs in the document and the corresponding document names as query results and outputting the query results.
In the above method for querying a document similarity paragraph, the similarity calculating step S22' includes:
step S221': if the length of the marked text is larger than the first length threshold and smaller than a second length threshold, obtaining the similarity between the marked text and the segmented text paragraphs by calculating embedding word vectors of the marked text and the segmented text paragraphs; or;
long text similarity calculation step S222': and if the length of the marked text is greater than the second length threshold, obtaining the similarity between the marked text and the segmented text paragraphs through an LAD topic model.
In a second aspect, an embodiment of the present application provides a system for querying a document similar to a paragraph, including:
a length judgment unit: judging whether the length of the marked text is greater than a first length threshold value or not;
a query result obtaining unit: if the length of the marked text is smaller than the first length threshold, the query result obtaining unit matches the documents in the document library according to the marked text to obtain a query result and outputs the query result; if the length of the marked text is larger than the first length threshold, the query result obtaining unit performs paragraph segmentation on the documents in the document library, then obtains a query result through similarity comparison, and outputs the query result.
In the system for querying the marks of the document similar paragraphs, if the length of the mark text is smaller than the first length threshold, the query result obtaining unit searches the mark text in all the documents in the document library, and takes the sentence where the mark text is located, the position of the sentence in the document and the corresponding document name as query results and outputs the query results.
The system for querying the marks of the document similar paragraphs, wherein the query result obtaining unit comprises:
a segmentation module: performing paragraph segmentation on the document according to the length of the mark text to obtain a plurality of segmented text paragraphs;
a similarity calculation module: calculating the similarity between the mark text and the segmentation text paragraphs according to the length of the mark text to obtain a plurality of similarities;
a similarity comparison module: and comparing the similarity with a similarity threshold, and then taking the segmented text paragraphs with the similarity higher than the similarity threshold, the positions of the segmented text paragraphs in the document and the corresponding document names as query results and outputting the query results.
In the system for querying a label of a similar paragraph in a document, if the length of the label text is greater than the first length threshold and smaller than the second length threshold, the similarity calculation module obtains the similarity between the label text and the segmented text paragraph by calculating an embedding word vector between the label text and the segmented text paragraph; if the length of the marked text is larger than the second length threshold, the similarity calculation module obtains the similarity between the marked text and the segmented text paragraphs through an LAD topic model.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the method for querying a document similarity paragraph according to the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for querying for tags of similar paragraphs of documents as described in the first aspect above.
Compared with the prior art, the invention has the advantages and positive effects that:
1. the method divides the marked texts into different types of marked texts according to different lengths, and adopts different matching strategies aiming at the marked texts with different lengths, so that the query result is more accurate;
2. the invention belongs to the technical field of deep learning, and fully considers semantic information when similarity calculation is carried out on medium and long texts, so that the matching effect is greatly improved, and the user experience is improved.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a method for querying documents for similar paragraphs according to the present invention;
FIG. 2 is a flowchart based on step S2' in FIG. 1 according to the present invention;
FIG. 3 is a flowchart based on step S22' in FIG. 2 according to the present invention;
FIG. 4 is a flowchart illustrating an embodiment of a method for querying similar paragraphs of a document according to the present invention;
FIG. 5 is a block diagram of a document similarity paragraph markup query system provided by the present invention;
fig. 6 is a block diagram of a computer device according to an embodiment of the present application.
Wherein the reference numerals are:
1. a length judgment unit; 2. a query result obtaining unit; 21. a segmentation module; 22. a similarity calculation module; 23. a similarity comparison module; 81. a processor; 82. a memory; 83. a communication interface; 80. a bus.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
Before describing in detail the various embodiments of the present invention, the core inventive concepts of the present invention are summarized and described in detail by the following several embodiments.
The method is divided into three different conditions according to the paragraph length marked by a user, different similarity calculation strategies are adopted for marked texts with different lengths, a full word matching method is adopted for short texts, an embedding word vector method is adopted for medium and long texts, the similarity is calculated for long texts by adopting an LDA topic model method, and finally paragraphs with the similarity higher than a threshold value are returned as results.
The first embodiment is as follows:
fig. 1 is a schematic step diagram of a method for querying a document similarity paragraph according to the present invention, and as shown in fig. 1, this embodiment discloses a specific implementation of a method for querying a document similarity paragraph by a tag (hereinafter referred to as "method").
Since the user may mark only a few words, or a long text, the algorithm should consider the similarity calculation problem of text paragraphs of different lengths at the same time. Therefore, the invention provides an automatic query matching method for text similar paragraphs, which is suitable for an enterprise-level document library.
Specifically, the method disclosed in this embodiment mainly includes the following steps:
step S1: judging whether the length of the marked text is greater than a first length threshold value or not; specifically, if the text length is smaller than the first length threshold, the visual mark text is a short text, and if the text length is larger than the first length threshold, the visual mark text is regarded as a long text or a medium-long text.
Step S2: and if the length of the marked text is smaller than the first length threshold, matching the documents in the document library according to the marked text to obtain a query result and outputting the query result.
Specifically, the short text adopts a strategy of full word matching, namely, the marked text is searched in all documents in the document library, and the sentence where the marked text is located, the position of the sentence in the document and the corresponding document name are used as query results and output.
Step S2': if the length of the marked text is larger than the first length threshold, the document in the document library is subjected to paragraph segmentation, and then a query result is obtained through similarity comparison and output.
Referring to fig. 2, step S2' includes the following steps:
step S21': performing paragraph segmentation on the document according to the length of the mark text to obtain a plurality of segmented text paragraphs;
when similarity calculation is carried out on the medium-length text and the long text, firstly, paragraph segmentation is carried out on the document, and the document is divided into a plurality of segmented text paragraphs with the length similar to that of the marked text, so that the similarity calculation is carried out on the document and the marked text in sequence. The lengths of the segmented text paragraphs and the marked texts need to be as close as possible, so that the calculation result of the similarity can be more accurate. Paragraph segmentation is usually done in words as the smallest units, and generally segmentation does not break up a sentence. However, if the length of a sentence is much greater than the length of the markup text, the sentence is split according to the length of the markup text.
Step S22': calculating the similarity between the mark text and the segmentation text paragraphs according to the length of the mark text to obtain a plurality of similarities;
referring to fig. 3, step S22' specifically includes the following contents:
step S221': if the length of the marked text is larger than the first length threshold and smaller than a second length threshold, obtaining the similarity between the marked text and the segmented text paragraphs by calculating embedding word vectors of the marked text and the segmented text paragraphs;
specifically, the marked text with the length greater than the first length threshold and less than the second length threshold is regarded as the medium-length text, and the similarity is calculated by adopting an embedding word vector method. Firstly, segmenting a marked text and a segmented text paragraph, filtering out stop words, and then obtaining an embedding word vector of each word after segmentation by using a word vector model after pre-training. Calculating the average value of all embedding word vectors in the marked text and the segmented text paragraphs according to the dimension respectively to obtain the average word vector as the vector representation of the marked text and the segmented text paragraphs, and then calculating the cosine distance of the two vectors as the similarity of the marked text and the segmented text paragraphs. If a large number of professional vocabularies exist in the documents in the document library, the corpus can be constructed by using the document library, and a word vector model is retrained by adopting word2vec or glove.
Step S222': and if the length of the marked text is greater than the second length threshold, obtaining the similarity between the marked text and the segmented text paragraphs through an LAD topic model.
Specifically, if the length of the marked text is greater than the second length threshold, the marked text is regarded as long text. The similarity calculation of long texts by using an embedding word vector method is not good in effect because the features of word vectors are weakened due to the overlong text when the word vector mean value is calculated. Therefore, the LDA topic model method is adopted to calculate the similarity. Firstly, a corpus is constructed through documents in a document library, and an LDA topic model is trained by utilizing the corpus; respectively obtaining the topic distribution of the marked text and the segmentation text paragraphs through the trained LDA topic model; and calculating the Hellinger distance between the marked text and the topic distribution of the segmented text paragraphs as the similarity of the marked text and the segmented text paragraphs.
Step S23': and comparing the similarity with a similarity threshold, and then taking the segmented text paragraphs with the similarity higher than the similarity threshold, the positions of the segmented text paragraphs in the document and the corresponding document names as query results and outputting the query results.
Specifically, when the marked text is a medium-length text or a long text, a similarity threshold needs to be set, and paragraphs lower than the threshold are regarded as dissimilar and are directly filtered out; paragraphs above the threshold value can return results on demand according to the similarity from high to low, and the final similar paragraphs, the positions of the paragraphs in the document and the corresponding document names are returned as the results.
Please refer to fig. 4. Fig. 4 is a schematic flowchart of an embodiment of a method for querying a document similarity paragraph according to the present invention, and with reference to fig. 4, an application flow of the method is specifically described as follows:
the invention is divided into three conditions of short, medium and long according to the paragraph length marked by the user. Different similarity calculation strategies are adopted for marks of different lengths. The whole process is as follows:
1. firstly, judging the length of the text marked by the user, and if the length is less than 6 characters, regarding the text as a short text. The short text adopts a strategy of full word matching, namely, all documents in a document library are searched for marked texts, and if the marked texts appear in a certain document, the whole sentence where the appearance position is located is taken as a return result. Matching of short text is similar to global search.
2. If the length of the text marked by the user is larger than 6 characters, similarity calculation is needed.
Paragraph segmentation is performed first. The paragraph segmentation is used for dividing the document into a plurality of segments with the length similar to that of the marked text so as to perform similarity calculation with the marked text in sequence. The lengths of the segmented text paragraphs and the marked text paragraphs are controlled to be as close as possible, so that the calculation result of the similarity can be more accurate. Paragraph segmentation is a minimum unit of a sentence, i.e., a sentence is not broken up by segmentation in general. However, if a sentence is much longer than the length of a tagged paragraph, the long sentence will be split up according to the tagged paragraph length. For example, if the markup text has a length of 20 characters and a certain sentence in the document has a length of 45 characters, the sentence is divided into two paragraphs, the former 20 characters and the latter 25 characters.
3. And selecting different text similarity calculation methods according to the length of the marked text after segmentation.
If the length of the marked text is less than 25 characters, the marked text is regarded as medium-length text. And calculating the similarity of the medium and long texts by adopting an embedding word vector method. The specific process is as follows: and performing word segmentation on the text paragraphs to stop words. And then, obtaining an embedding word vector of each word after word segmentation by using a pre-training word vector model. And averaging the word vectors of all words according to the dimensions, and taking the obtained average word vector as the vector representation of the text paragraph. Vector representations of the marked text paragraphs and the segmented text paragraphs are respectively solved, and then the cosine distance of the two vectors is calculated to be used as the similarity of the two paragraphs.
4. If the length of the marked text is more than 25 characters, the marked text is regarded as long text. The similarity calculation of long texts by using an embedding word vector method has a poor effect because the features of word vectors are weakened due to overlong texts during averaging. Therefore, the LDA topic model method is adopted to calculate the similarity. And constructing a corpus by using the documents in the document library, and training the LDA topic model. And respectively obtaining the topic distribution of the marked text paragraphs and the segmented text paragraphs through the trained LDA model. The Hellinger distance of the two topic distributions is calculated as the similarity of the two paragraphs.
5. Threshold filtering: setting a similarity threshold, and directly filtering out the paragraphs below the threshold if the paragraphs are not similar. For all paragraphs in a document that are above the threshold, the top score topK may be returned as the final result.
6. And returning a result: the final similar sentence or paragraph, the position of the sentence or paragraph in the document (offset), and the corresponding document name (document ID) are returned as the result.
The invention provides a method for improving the matching effect of different situations with similar contents by using an unsupervised model and fully considering semantic information under the condition of not marking training data and adopting a concept and a specific fusion mode of fusing a plurality of matching strategies aiming at different text lengths.
Example two:
in combination with the method for querying a document similar segment by using a tag, the embodiment discloses a specific implementation example of a system for querying a document similar segment by using a tag (hereinafter referred to as "system").
Referring to fig. 5, the system includes:
length determination unit 1: judging whether the length of the marked text is greater than a first length threshold value or not;
query result obtaining unit 2: if the length of the marked text is smaller than the first length threshold, the query result obtaining unit matches the documents in the document library according to the marked text to obtain a query result and outputs the query result; if the length of the marked text is larger than the first length threshold, the query result obtaining unit performs paragraph segmentation on the documents in the document library, then obtains a query result through similarity comparison, and outputs the query result.
Specifically, if the length of the tagged text is smaller than the first length threshold, the query result obtaining unit 2 searches all documents in the document library for the tagged text, and outputs a sentence where the tagged text is located, a position of the sentence in the document, and a corresponding document name as a query result.
Specifically, the query result obtaining unit 2 includes:
the segmentation module 21: performing paragraph segmentation on the document according to the length of the mark text to obtain a plurality of segmented text paragraphs;
the similarity calculation module 22: calculating the similarity between the mark text and the segmentation text paragraphs according to the length of the mark text to obtain a plurality of similarities;
the similarity comparison module 23: and comparing the similarity with a similarity threshold, and then taking the segmented text paragraphs with the similarity higher than the similarity threshold, the positions of the segmented text paragraphs in the document and the corresponding document names as query results and outputting the query results.
Specifically, if the length of the tagged text is greater than the first length threshold and smaller than the second length threshold, the similarity calculation module 22 obtains the similarity between the tagged text and the segmented text paragraphs by calculating embedding word vectors of the tagged text and the segmented text paragraphs; if the length of the tagged text is greater than the second length threshold, the similarity calculation module 22 obtains the similarity between the tagged text and the segmented text paragraphs through an LAD topic model.
Please refer to the description of the first embodiment, which will not be repeated herein, for the technical solutions of the same parts in the system and the method for querying a document similar segment.
Example three:
referring to fig. 6, the present embodiment discloses an embodiment of a computer device. The computer device may comprise a processor 81 and a memory 82 in which computer program instructions are stored.
Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.
The processor 81 realizes the mark query method of the document similar paragraph in any one of the above embodiments by reading and executing the computer program instructions stored in the memory 82.
In some of these embodiments, the computer device may also include a communication interface 83 and a bus 80. As shown in fig. 6, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.
The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.
In addition, in combination with the method for querying a document similarity paragraph in the foregoing embodiments, the embodiments of the present application may provide a computer-readable storage medium to implement the method. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement a method for label query of similar paragraphs of a document in any of the above embodiments.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A method for querying a document similar paragraph by marking is characterized by comprising the following steps:
length determination step S1: judging whether the length of the marked text is greater than a first length threshold value or not;
query result obtaining step S2: if the length of the marked text is smaller than the first length threshold, matching the documents in the document library according to the marked text to obtain a query result and outputting the query result; or;
query result obtaining step S2': if the length of the marked text is larger than the first length threshold, the document in the document library is subjected to paragraph segmentation, and then a query result is obtained through similarity comparison and output.
2. The method for querying for tags of similar paragraphs in documents according to claim 1, wherein said query result obtaining step S2 includes: and if the length of the tagged text is smaller than the first length threshold, searching the tagged text in all the documents in the document library, and taking the sentence where the tagged text is located, the position of the sentence in the document and the corresponding document name as query results and outputting the query results.
3. The method for querying for tags of similar paragraphs in documents according to claim 1, wherein said query result obtaining step S2' includes:
dividing step S21': performing paragraph segmentation on the document according to the length of the mark text to obtain a plurality of segmented text paragraphs;
similarity calculation step S22': calculating the similarity between the mark text and the segmentation text paragraphs according to the length of the mark text to obtain a plurality of similarities;
similarity comparison step S23': and comparing the similarity with a similarity threshold, and then taking the segmented text paragraphs with the similarity higher than the similarity threshold, the positions of the segmented text paragraphs in the document and the corresponding document names as query results and outputting the query results.
4. The method for querying for the mark of the document similar section as claimed in claim 3, wherein the similarity calculating step S22' comprises:
step S221': if the length of the marked text is larger than the first length threshold and smaller than a second length threshold, obtaining the similarity between the marked text and the segmented text paragraphs by calculating embedding word vectors of the marked text and the segmented text paragraphs; or;
long text similarity calculation step S222': and if the length of the marked text is greater than the second length threshold, obtaining the similarity between the marked text and the segmented text paragraphs through an LAD topic model.
5. A system for query markup of similar paragraphs of a document, comprising:
a length judgment unit: judging whether the length of the marked text is greater than a first length threshold value or not;
a query result obtaining unit: if the length of the marked text is smaller than the first length threshold, the query result obtaining unit matches the documents in the document library according to the marked text to obtain a query result and outputs the query result; if the length of the marked text is larger than the first length threshold, the query result obtaining unit performs paragraph segmentation on the documents in the document library, then obtains a query result through similarity comparison, and outputs the query result.
6. The system for querying for tokens of similar paragraphs according to claim 5, wherein if the length of the token text is smaller than the first length threshold, the query result obtaining unit searches all documents in the document repository for the token text, and takes the sentence where the token text is located, the position of the sentence in the document, and the corresponding document name as the query result and outputs the query result.
7. The system for query with markup of document similar paragraphs according to claim 5, wherein said query result obtaining unit comprises:
a segmentation module: performing paragraph segmentation on the document according to the length of the mark text to obtain a plurality of segmented text paragraphs;
a similarity calculation module: calculating the similarity between the mark text and the segmentation text paragraphs according to the length of the mark text to obtain a plurality of similarities;
a similarity comparison module: and comparing the similarity with a similarity threshold, and then taking the segmented text paragraphs with the similarity higher than the similarity threshold, the positions of the segmented text paragraphs in the document and the corresponding document names as query results and outputting the query results.
8. The system of claim 7, wherein if the length of the tagged text is greater than the first length threshold and less than a second length threshold, the similarity calculation module obtains the similarity between the tagged text and the segmented text by calculating an embedding word vector between the tagged text and the segmented text; if the length of the marked text is larger than the second length threshold, the similarity calculation module obtains the similarity between the marked text and the segmented text paragraphs through an LAD topic model.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the markup query method for document similarity paragraphs according to any one of claims 1 to 4 when executing the computer program.
10. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing a method for tagged query of document similar paragraphs according to any of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110388914.XA CN113139374A (en) | 2021-04-12 | 2021-04-12 | Method, system, equipment and storage medium for querying marks of document similar paragraphs |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110388914.XA CN113139374A (en) | 2021-04-12 | 2021-04-12 | Method, system, equipment and storage medium for querying marks of document similar paragraphs |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113139374A true CN113139374A (en) | 2021-07-20 |
Family
ID=76811178
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110388914.XA Pending CN113139374A (en) | 2021-04-12 | 2021-04-12 | Method, system, equipment and storage medium for querying marks of document similar paragraphs |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113139374A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115687840A (en) * | 2023-01-03 | 2023-02-03 | 上海朝阳永续信息技术股份有限公司 | Method, apparatus and storage medium for processing predetermined type information in web page |
-
2021
- 2021-04-12 CN CN202110388914.XA patent/CN113139374A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115687840A (en) * | 2023-01-03 | 2023-02-03 | 上海朝阳永续信息技术股份有限公司 | Method, apparatus and storage medium for processing predetermined type information in web page |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109145153B (en) | Intention category identification method and device | |
CN108829893B (en) | Method and device for determining video label, storage medium and terminal equipment | |
CN107463605B (en) | Method and device for identifying low-quality news resource, computer equipment and readable medium | |
WO2019153551A1 (en) | Article classification method and apparatus, computer device and storage medium | |
CN111581355B (en) | Threat information topic detection method, device and computer storage medium | |
CN104881458B (en) | A kind of mask method and device of Web page subject | |
CN109635157B (en) | Model generation method, video search method, device, terminal and storage medium | |
US20240126799A1 (en) | Topic segmentation of image-derived text | |
US20160188569A1 (en) | Generating a Table of Contents for Unformatted Text | |
CN107861948B (en) | Label extraction method, device, equipment and medium | |
CN110083832B (en) | Article reprint relation identification method, device, equipment and readable storage medium | |
CN109271624B (en) | Target word determination method, device and storage medium | |
CN112464640A (en) | Data element analysis method, device, electronic device and storage medium | |
CN112347758A (en) | Text abstract generation method and device, terminal equipment and storage medium | |
CN114995903B (en) | Class label identification method and device based on pre-training language model | |
CN112784009A (en) | Subject term mining method and device, electronic equipment and storage medium | |
CN112232070A (en) | Natural language processing model construction method, system, electronic device and storage medium | |
CN113204956B (en) | Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device | |
US11966455B2 (en) | Text partitioning method, text classifying method, apparatus, device and storage medium | |
CN113139374A (en) | Method, system, equipment and storage medium for querying marks of document similar paragraphs | |
CN114090781A (en) | Text data-based repulsion event detection method and device | |
CN113139383A (en) | Document sorting method, system, electronic equipment and storage medium | |
CN112949299A (en) | Method and device for generating news manuscript, storage medium and electronic device | |
CN109815996B (en) | Scene self-adaptation method and device based on recurrent neural network | |
CN112446204A (en) | Document tag determination method, system and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |