CN113139374A - Method, system, equipment and storage medium for querying marks of document similar paragraphs - Google Patents

Method, system, equipment and storage medium for querying marks of document similar paragraphs Download PDF

Info

Publication number
CN113139374A
CN113139374A CN202110388914.XA CN202110388914A CN113139374A CN 113139374 A CN113139374 A CN 113139374A CN 202110388914 A CN202110388914 A CN 202110388914A CN 113139374 A CN113139374 A CN 113139374A
Authority
CN
China
Prior art keywords
text
length
similarity
document
paragraphs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110388914.XA
Other languages
Chinese (zh)
Inventor
刘俊辰
尤旸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Minglue Zhaohui Technology Co Ltd
Original Assignee
Beijing Minglue Zhaohui Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Minglue Zhaohui Technology Co Ltd filed Critical Beijing Minglue Zhaohui Technology Co Ltd
Priority to CN202110388914.XA priority Critical patent/CN113139374A/en
Publication of CN113139374A publication Critical patent/CN113139374A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a system, equipment and a storage medium for marking and inquiring document similar paragraphs, wherein the method comprises the following steps: judging whether the length of the marked text is greater than a first length threshold value or not; if the length of the marked text is smaller than the first length threshold value, matching the documents in the document library according to the marked text to obtain a query result and outputting the query result; or; if the length of the marked text is larger than the first length threshold value, the document in the document library is subjected to paragraph segmentation, and then a query result is obtained through similarity comparison and output. The method divides the marked texts into different types of marked texts according to different lengths, and adopts different matching strategies aiming at the marked texts with different lengths, so that the query result is more accurate.

Description

Method, system, equipment and storage medium for querying marks of document similar paragraphs
Technical Field
The invention relates to the technical field of data analysis, in particular to a method, a system, equipment and a storage medium for marking and inquiring similar paragraphs of a document.
Background
Today, many businesses have large volumes of document text data, including product manuals, business contracts, deployment documentation, and so forth, which are highly specialized documents. In order to facilitate unified management, many companies centralize the document data and provide intelligent services such as query, reading, recommendation and the like. The automatic query matching service of the text similar paragraphs can help the user to better utilize the text resources in the document library, and the value of the document resources is improved. The basic functions of the automatic query matching service for text similar paragraphs are as follows: the user manually marks a segment of text during reading, and after marking, the system automatically matches segments similar to the marked segment content from all the documents in the document library in the background by using NLP and other related technologies and returns the segments to the user. The user can find paragraphs or texts with similar contents as references according to the matching result.
The prior art is mostly a solution similar to text duplication checking. For example, SimHash, the approximate calculation is as follows:
1. extracting features and weights corresponding to the features from the document;
2. hash is carried out on the characteristics to generate corresponding hash values;
hash value weighting: and (3) performing cyclic processing on each bit of the characteristic hash value: if the bit value is 1, replacing the bit value by weight, otherwise, replacing the bit value by-weight;
4. and (3) summing: summing the weighted results of the feature hash according to bits, and then binarizing the results according to bits: if the value is more than 0, the value is 1, otherwise, the value is 0, and the final SimHash value is obtained.
And after obtaining the SimHash values of the documents, calculating the Hamming distance of the SimHash values of the two documents as the similarity of the two documents.
However, SimHash itself is an algorithm used by Google for removing the duplication of massive web pages, and is suitable for similarity calculation of the whole document. For shorter text passages, the SimHash often cannot achieve good effect. In addition, the SimHash does not take semantic information of the text into consideration, and for a language environment such as chinese where the expression mode is very flexible and only relates to one or several concepts, but not the case where large segments of contents are similar, the SimHash cannot obtain an accurate similarity result.
Disclosure of Invention
The invention provides a method, a system, equipment and a storage medium for marking and querying document similar paragraphs, aiming at the technical problems that the prior art cannot perform similarity calculation on shorter text paragraphs and does not consider semantic information of texts.
In a first aspect, an embodiment of the present application provides a method for querying a document similarity paragraph through a tag, including:
length determination step S1: judging whether the length of the marked text is greater than a first length threshold value or not;
query result obtaining step S2: if the length of the marked text is smaller than the first length threshold, matching the documents in the document library according to the marked text to obtain a query result and outputting the query result; or;
query result obtaining step S2': if the length of the marked text is larger than the first length threshold, the document in the document library is subjected to paragraph segmentation, and then a query result is obtained through similarity comparison and output.
The method for querying for tags of similar paragraphs in documents as described above, wherein the query result obtaining step S2 includes: and if the length of the tagged text is smaller than the first length threshold, searching the tagged text in all the documents in the document library, and taking the sentence where the tagged text is located, the position of the sentence in the document and the corresponding document name as query results and outputting the query results.
The method for querying for tags of similar paragraphs in documents as described above, wherein the step of obtaining query results S2' includes:
dividing step S21': performing paragraph segmentation on the document according to the length of the mark text to obtain a plurality of segmented text paragraphs;
similarity calculation step S22': calculating the similarity between the mark text and the segmentation text paragraphs according to the length of the mark text to obtain a plurality of similarities;
similarity comparison step S23': and comparing the similarity with a similarity threshold, and then taking the segmented text paragraphs with the similarity higher than the similarity threshold, the positions of the segmented text paragraphs in the document and the corresponding document names as query results and outputting the query results.
In the above method for querying a document similarity paragraph, the similarity calculating step S22' includes:
step S221': if the length of the marked text is larger than the first length threshold and smaller than a second length threshold, obtaining the similarity between the marked text and the segmented text paragraphs by calculating embedding word vectors of the marked text and the segmented text paragraphs; or;
long text similarity calculation step S222': and if the length of the marked text is greater than the second length threshold, obtaining the similarity between the marked text and the segmented text paragraphs through an LAD topic model.
In a second aspect, an embodiment of the present application provides a system for querying a document similar to a paragraph, including:
a length judgment unit: judging whether the length of the marked text is greater than a first length threshold value or not;
a query result obtaining unit: if the length of the marked text is smaller than the first length threshold, the query result obtaining unit matches the documents in the document library according to the marked text to obtain a query result and outputs the query result; if the length of the marked text is larger than the first length threshold, the query result obtaining unit performs paragraph segmentation on the documents in the document library, then obtains a query result through similarity comparison, and outputs the query result.
In the system for querying the marks of the document similar paragraphs, if the length of the mark text is smaller than the first length threshold, the query result obtaining unit searches the mark text in all the documents in the document library, and takes the sentence where the mark text is located, the position of the sentence in the document and the corresponding document name as query results and outputs the query results.
The system for querying the marks of the document similar paragraphs, wherein the query result obtaining unit comprises:
a segmentation module: performing paragraph segmentation on the document according to the length of the mark text to obtain a plurality of segmented text paragraphs;
a similarity calculation module: calculating the similarity between the mark text and the segmentation text paragraphs according to the length of the mark text to obtain a plurality of similarities;
a similarity comparison module: and comparing the similarity with a similarity threshold, and then taking the segmented text paragraphs with the similarity higher than the similarity threshold, the positions of the segmented text paragraphs in the document and the corresponding document names as query results and outputting the query results.
In the system for querying a label of a similar paragraph in a document, if the length of the label text is greater than the first length threshold and smaller than the second length threshold, the similarity calculation module obtains the similarity between the label text and the segmented text paragraph by calculating an embedding word vector between the label text and the segmented text paragraph; if the length of the marked text is larger than the second length threshold, the similarity calculation module obtains the similarity between the marked text and the segmented text paragraphs through an LAD topic model.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the method for querying a document similarity paragraph according to the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for querying for tags of similar paragraphs of documents as described in the first aspect above.
Compared with the prior art, the invention has the advantages and positive effects that:
1. the method divides the marked texts into different types of marked texts according to different lengths, and adopts different matching strategies aiming at the marked texts with different lengths, so that the query result is more accurate;
2. the invention belongs to the technical field of deep learning, and fully considers semantic information when similarity calculation is carried out on medium and long texts, so that the matching effect is greatly improved, and the user experience is improved.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a method for querying documents for similar paragraphs according to the present invention;
FIG. 2 is a flowchart based on step S2' in FIG. 1 according to the present invention;
FIG. 3 is a flowchart based on step S22' in FIG. 2 according to the present invention;
FIG. 4 is a flowchart illustrating an embodiment of a method for querying similar paragraphs of a document according to the present invention;
FIG. 5 is a block diagram of a document similarity paragraph markup query system provided by the present invention;
fig. 6 is a block diagram of a computer device according to an embodiment of the present application.
Wherein the reference numerals are:
1. a length judgment unit; 2. a query result obtaining unit; 21. a segmentation module; 22. a similarity calculation module; 23. a similarity comparison module; 81. a processor; 82. a memory; 83. a communication interface; 80. a bus.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
Before describing in detail the various embodiments of the present invention, the core inventive concepts of the present invention are summarized and described in detail by the following several embodiments.
The method is divided into three different conditions according to the paragraph length marked by a user, different similarity calculation strategies are adopted for marked texts with different lengths, a full word matching method is adopted for short texts, an embedding word vector method is adopted for medium and long texts, the similarity is calculated for long texts by adopting an LDA topic model method, and finally paragraphs with the similarity higher than a threshold value are returned as results.
The first embodiment is as follows:
fig. 1 is a schematic step diagram of a method for querying a document similarity paragraph according to the present invention, and as shown in fig. 1, this embodiment discloses a specific implementation of a method for querying a document similarity paragraph by a tag (hereinafter referred to as "method").
Since the user may mark only a few words, or a long text, the algorithm should consider the similarity calculation problem of text paragraphs of different lengths at the same time. Therefore, the invention provides an automatic query matching method for text similar paragraphs, which is suitable for an enterprise-level document library.
Specifically, the method disclosed in this embodiment mainly includes the following steps:
step S1: judging whether the length of the marked text is greater than a first length threshold value or not; specifically, if the text length is smaller than the first length threshold, the visual mark text is a short text, and if the text length is larger than the first length threshold, the visual mark text is regarded as a long text or a medium-long text.
Step S2: and if the length of the marked text is smaller than the first length threshold, matching the documents in the document library according to the marked text to obtain a query result and outputting the query result.
Specifically, the short text adopts a strategy of full word matching, namely, the marked text is searched in all documents in the document library, and the sentence where the marked text is located, the position of the sentence in the document and the corresponding document name are used as query results and output.
Step S2': if the length of the marked text is larger than the first length threshold, the document in the document library is subjected to paragraph segmentation, and then a query result is obtained through similarity comparison and output.
Referring to fig. 2, step S2' includes the following steps:
step S21': performing paragraph segmentation on the document according to the length of the mark text to obtain a plurality of segmented text paragraphs;
when similarity calculation is carried out on the medium-length text and the long text, firstly, paragraph segmentation is carried out on the document, and the document is divided into a plurality of segmented text paragraphs with the length similar to that of the marked text, so that the similarity calculation is carried out on the document and the marked text in sequence. The lengths of the segmented text paragraphs and the marked texts need to be as close as possible, so that the calculation result of the similarity can be more accurate. Paragraph segmentation is usually done in words as the smallest units, and generally segmentation does not break up a sentence. However, if the length of a sentence is much greater than the length of the markup text, the sentence is split according to the length of the markup text.
Step S22': calculating the similarity between the mark text and the segmentation text paragraphs according to the length of the mark text to obtain a plurality of similarities;
referring to fig. 3, step S22' specifically includes the following contents:
step S221': if the length of the marked text is larger than the first length threshold and smaller than a second length threshold, obtaining the similarity between the marked text and the segmented text paragraphs by calculating embedding word vectors of the marked text and the segmented text paragraphs;
specifically, the marked text with the length greater than the first length threshold and less than the second length threshold is regarded as the medium-length text, and the similarity is calculated by adopting an embedding word vector method. Firstly, segmenting a marked text and a segmented text paragraph, filtering out stop words, and then obtaining an embedding word vector of each word after segmentation by using a word vector model after pre-training. Calculating the average value of all embedding word vectors in the marked text and the segmented text paragraphs according to the dimension respectively to obtain the average word vector as the vector representation of the marked text and the segmented text paragraphs, and then calculating the cosine distance of the two vectors as the similarity of the marked text and the segmented text paragraphs. If a large number of professional vocabularies exist in the documents in the document library, the corpus can be constructed by using the document library, and a word vector model is retrained by adopting word2vec or glove.
Step S222': and if the length of the marked text is greater than the second length threshold, obtaining the similarity between the marked text and the segmented text paragraphs through an LAD topic model.
Specifically, if the length of the marked text is greater than the second length threshold, the marked text is regarded as long text. The similarity calculation of long texts by using an embedding word vector method is not good in effect because the features of word vectors are weakened due to the overlong text when the word vector mean value is calculated. Therefore, the LDA topic model method is adopted to calculate the similarity. Firstly, a corpus is constructed through documents in a document library, and an LDA topic model is trained by utilizing the corpus; respectively obtaining the topic distribution of the marked text and the segmentation text paragraphs through the trained LDA topic model; and calculating the Hellinger distance between the marked text and the topic distribution of the segmented text paragraphs as the similarity of the marked text and the segmented text paragraphs.
Step S23': and comparing the similarity with a similarity threshold, and then taking the segmented text paragraphs with the similarity higher than the similarity threshold, the positions of the segmented text paragraphs in the document and the corresponding document names as query results and outputting the query results.
Specifically, when the marked text is a medium-length text or a long text, a similarity threshold needs to be set, and paragraphs lower than the threshold are regarded as dissimilar and are directly filtered out; paragraphs above the threshold value can return results on demand according to the similarity from high to low, and the final similar paragraphs, the positions of the paragraphs in the document and the corresponding document names are returned as the results.
Please refer to fig. 4. Fig. 4 is a schematic flowchart of an embodiment of a method for querying a document similarity paragraph according to the present invention, and with reference to fig. 4, an application flow of the method is specifically described as follows:
the invention is divided into three conditions of short, medium and long according to the paragraph length marked by the user. Different similarity calculation strategies are adopted for marks of different lengths. The whole process is as follows:
1. firstly, judging the length of the text marked by the user, and if the length is less than 6 characters, regarding the text as a short text. The short text adopts a strategy of full word matching, namely, all documents in a document library are searched for marked texts, and if the marked texts appear in a certain document, the whole sentence where the appearance position is located is taken as a return result. Matching of short text is similar to global search.
2. If the length of the text marked by the user is larger than 6 characters, similarity calculation is needed.
Paragraph segmentation is performed first. The paragraph segmentation is used for dividing the document into a plurality of segments with the length similar to that of the marked text so as to perform similarity calculation with the marked text in sequence. The lengths of the segmented text paragraphs and the marked text paragraphs are controlled to be as close as possible, so that the calculation result of the similarity can be more accurate. Paragraph segmentation is a minimum unit of a sentence, i.e., a sentence is not broken up by segmentation in general. However, if a sentence is much longer than the length of a tagged paragraph, the long sentence will be split up according to the tagged paragraph length. For example, if the markup text has a length of 20 characters and a certain sentence in the document has a length of 45 characters, the sentence is divided into two paragraphs, the former 20 characters and the latter 25 characters.
3. And selecting different text similarity calculation methods according to the length of the marked text after segmentation.
If the length of the marked text is less than 25 characters, the marked text is regarded as medium-length text. And calculating the similarity of the medium and long texts by adopting an embedding word vector method. The specific process is as follows: and performing word segmentation on the text paragraphs to stop words. And then, obtaining an embedding word vector of each word after word segmentation by using a pre-training word vector model. And averaging the word vectors of all words according to the dimensions, and taking the obtained average word vector as the vector representation of the text paragraph. Vector representations of the marked text paragraphs and the segmented text paragraphs are respectively solved, and then the cosine distance of the two vectors is calculated to be used as the similarity of the two paragraphs.
4. If the length of the marked text is more than 25 characters, the marked text is regarded as long text. The similarity calculation of long texts by using an embedding word vector method has a poor effect because the features of word vectors are weakened due to overlong texts during averaging. Therefore, the LDA topic model method is adopted to calculate the similarity. And constructing a corpus by using the documents in the document library, and training the LDA topic model. And respectively obtaining the topic distribution of the marked text paragraphs and the segmented text paragraphs through the trained LDA model. The Hellinger distance of the two topic distributions is calculated as the similarity of the two paragraphs.
5. Threshold filtering: setting a similarity threshold, and directly filtering out the paragraphs below the threshold if the paragraphs are not similar. For all paragraphs in a document that are above the threshold, the top score topK may be returned as the final result.
6. And returning a result: the final similar sentence or paragraph, the position of the sentence or paragraph in the document (offset), and the corresponding document name (document ID) are returned as the result.
The invention provides a method for improving the matching effect of different situations with similar contents by using an unsupervised model and fully considering semantic information under the condition of not marking training data and adopting a concept and a specific fusion mode of fusing a plurality of matching strategies aiming at different text lengths.
Example two:
in combination with the method for querying a document similar segment by using a tag, the embodiment discloses a specific implementation example of a system for querying a document similar segment by using a tag (hereinafter referred to as "system").
Referring to fig. 5, the system includes:
length determination unit 1: judging whether the length of the marked text is greater than a first length threshold value or not;
query result obtaining unit 2: if the length of the marked text is smaller than the first length threshold, the query result obtaining unit matches the documents in the document library according to the marked text to obtain a query result and outputs the query result; if the length of the marked text is larger than the first length threshold, the query result obtaining unit performs paragraph segmentation on the documents in the document library, then obtains a query result through similarity comparison, and outputs the query result.
Specifically, if the length of the tagged text is smaller than the first length threshold, the query result obtaining unit 2 searches all documents in the document library for the tagged text, and outputs a sentence where the tagged text is located, a position of the sentence in the document, and a corresponding document name as a query result.
Specifically, the query result obtaining unit 2 includes:
the segmentation module 21: performing paragraph segmentation on the document according to the length of the mark text to obtain a plurality of segmented text paragraphs;
the similarity calculation module 22: calculating the similarity between the mark text and the segmentation text paragraphs according to the length of the mark text to obtain a plurality of similarities;
the similarity comparison module 23: and comparing the similarity with a similarity threshold, and then taking the segmented text paragraphs with the similarity higher than the similarity threshold, the positions of the segmented text paragraphs in the document and the corresponding document names as query results and outputting the query results.
Specifically, if the length of the tagged text is greater than the first length threshold and smaller than the second length threshold, the similarity calculation module 22 obtains the similarity between the tagged text and the segmented text paragraphs by calculating embedding word vectors of the tagged text and the segmented text paragraphs; if the length of the tagged text is greater than the second length threshold, the similarity calculation module 22 obtains the similarity between the tagged text and the segmented text paragraphs through an LAD topic model.
Please refer to the description of the first embodiment, which will not be repeated herein, for the technical solutions of the same parts in the system and the method for querying a document similar segment.
Example three:
referring to fig. 6, the present embodiment discloses an embodiment of a computer device. The computer device may comprise a processor 81 and a memory 82 in which computer program instructions are stored.
Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 82 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.
The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.
The processor 81 realizes the mark query method of the document similar paragraph in any one of the above embodiments by reading and executing the computer program instructions stored in the memory 82.
In some of these embodiments, the computer device may also include a communication interface 83 and a bus 80. As shown in fig. 6, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.
The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.
Bus 80 includes hardware, software, or both to couple the components of the computer device to each other. Bus 80 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 80 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Electronics Association), abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 80 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.
In addition, in combination with the method for querying a document similarity paragraph in the foregoing embodiments, the embodiments of the present application may provide a computer-readable storage medium to implement the method. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement a method for label query of similar paragraphs of a document in any of the above embodiments.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for querying a document similar paragraph by marking is characterized by comprising the following steps:
length determination step S1: judging whether the length of the marked text is greater than a first length threshold value or not;
query result obtaining step S2: if the length of the marked text is smaller than the first length threshold, matching the documents in the document library according to the marked text to obtain a query result and outputting the query result; or;
query result obtaining step S2': if the length of the marked text is larger than the first length threshold, the document in the document library is subjected to paragraph segmentation, and then a query result is obtained through similarity comparison and output.
2. The method for querying for tags of similar paragraphs in documents according to claim 1, wherein said query result obtaining step S2 includes: and if the length of the tagged text is smaller than the first length threshold, searching the tagged text in all the documents in the document library, and taking the sentence where the tagged text is located, the position of the sentence in the document and the corresponding document name as query results and outputting the query results.
3. The method for querying for tags of similar paragraphs in documents according to claim 1, wherein said query result obtaining step S2' includes:
dividing step S21': performing paragraph segmentation on the document according to the length of the mark text to obtain a plurality of segmented text paragraphs;
similarity calculation step S22': calculating the similarity between the mark text and the segmentation text paragraphs according to the length of the mark text to obtain a plurality of similarities;
similarity comparison step S23': and comparing the similarity with a similarity threshold, and then taking the segmented text paragraphs with the similarity higher than the similarity threshold, the positions of the segmented text paragraphs in the document and the corresponding document names as query results and outputting the query results.
4. The method for querying for the mark of the document similar section as claimed in claim 3, wherein the similarity calculating step S22' comprises:
step S221': if the length of the marked text is larger than the first length threshold and smaller than a second length threshold, obtaining the similarity between the marked text and the segmented text paragraphs by calculating embedding word vectors of the marked text and the segmented text paragraphs; or;
long text similarity calculation step S222': and if the length of the marked text is greater than the second length threshold, obtaining the similarity between the marked text and the segmented text paragraphs through an LAD topic model.
5. A system for query markup of similar paragraphs of a document, comprising:
a length judgment unit: judging whether the length of the marked text is greater than a first length threshold value or not;
a query result obtaining unit: if the length of the marked text is smaller than the first length threshold, the query result obtaining unit matches the documents in the document library according to the marked text to obtain a query result and outputs the query result; if the length of the marked text is larger than the first length threshold, the query result obtaining unit performs paragraph segmentation on the documents in the document library, then obtains a query result through similarity comparison, and outputs the query result.
6. The system for querying for tokens of similar paragraphs according to claim 5, wherein if the length of the token text is smaller than the first length threshold, the query result obtaining unit searches all documents in the document repository for the token text, and takes the sentence where the token text is located, the position of the sentence in the document, and the corresponding document name as the query result and outputs the query result.
7. The system for query with markup of document similar paragraphs according to claim 5, wherein said query result obtaining unit comprises:
a segmentation module: performing paragraph segmentation on the document according to the length of the mark text to obtain a plurality of segmented text paragraphs;
a similarity calculation module: calculating the similarity between the mark text and the segmentation text paragraphs according to the length of the mark text to obtain a plurality of similarities;
a similarity comparison module: and comparing the similarity with a similarity threshold, and then taking the segmented text paragraphs with the similarity higher than the similarity threshold, the positions of the segmented text paragraphs in the document and the corresponding document names as query results and outputting the query results.
8. The system of claim 7, wherein if the length of the tagged text is greater than the first length threshold and less than a second length threshold, the similarity calculation module obtains the similarity between the tagged text and the segmented text by calculating an embedding word vector between the tagged text and the segmented text; if the length of the marked text is larger than the second length threshold, the similarity calculation module obtains the similarity between the marked text and the segmented text paragraphs through an LAD topic model.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the markup query method for document similarity paragraphs according to any one of claims 1 to 4 when executing the computer program.
10. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing a method for tagged query of document similar paragraphs according to any of claims 1 to 4.
CN202110388914.XA 2021-04-12 2021-04-12 Method, system, equipment and storage medium for querying marks of document similar paragraphs Pending CN113139374A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110388914.XA CN113139374A (en) 2021-04-12 2021-04-12 Method, system, equipment and storage medium for querying marks of document similar paragraphs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110388914.XA CN113139374A (en) 2021-04-12 2021-04-12 Method, system, equipment and storage medium for querying marks of document similar paragraphs

Publications (1)

Publication Number Publication Date
CN113139374A true CN113139374A (en) 2021-07-20

Family

ID=76811178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110388914.XA Pending CN113139374A (en) 2021-04-12 2021-04-12 Method, system, equipment and storage medium for querying marks of document similar paragraphs

Country Status (1)

Country Link
CN (1) CN113139374A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115687840A (en) * 2023-01-03 2023-02-03 上海朝阳永续信息技术股份有限公司 Method, apparatus and storage medium for processing predetermined type information in web page

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115687840A (en) * 2023-01-03 2023-02-03 上海朝阳永续信息技术股份有限公司 Method, apparatus and storage medium for processing predetermined type information in web page

Similar Documents

Publication Publication Date Title
CN109145153B (en) Intention category identification method and device
CN108829893B (en) Method and device for determining video label, storage medium and terminal equipment
CN107463605B (en) Method and device for identifying low-quality news resource, computer equipment and readable medium
WO2019153551A1 (en) Article classification method and apparatus, computer device and storage medium
CN111581355B (en) Threat information topic detection method, device and computer storage medium
CN104881458B (en) A kind of mask method and device of Web page subject
CN109635157B (en) Model generation method, video search method, device, terminal and storage medium
US20240126799A1 (en) Topic segmentation of image-derived text
US20160188569A1 (en) Generating a Table of Contents for Unformatted Text
CN107861948B (en) Label extraction method, device, equipment and medium
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN109271624B (en) Target word determination method, device and storage medium
CN112464640A (en) Data element analysis method, device, electronic device and storage medium
CN112347758A (en) Text abstract generation method and device, terminal equipment and storage medium
CN114995903B (en) Class label identification method and device based on pre-training language model
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
CN112232070A (en) Natural language processing model construction method, system, electronic device and storage medium
CN113204956B (en) Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
US11966455B2 (en) Text partitioning method, text classifying method, apparatus, device and storage medium
CN113139374A (en) Method, system, equipment and storage medium for querying marks of document similar paragraphs
CN114090781A (en) Text data-based repulsion event detection method and device
CN113139383A (en) Document sorting method, system, electronic equipment and storage medium
CN112949299A (en) Method and device for generating news manuscript, storage medium and electronic device
CN109815996B (en) Scene self-adaptation method and device based on recurrent neural network
CN112446204A (en) Document tag determination method, system and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination