WO2021037012A1 - Text information navigation and browsing method, apparatus, server and storage medium - Google Patents
Text information navigation and browsing method, apparatus, server and storage medium Download PDFInfo
- Publication number
- WO2021037012A1 WO2021037012A1 PCT/CN2020/110994 CN2020110994W WO2021037012A1 WO 2021037012 A1 WO2021037012 A1 WO 2021037012A1 CN 2020110994 W CN2020110994 W CN 2020110994W WO 2021037012 A1 WO2021037012 A1 WO 2021037012A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- information
- text
- similarity
- key feature
- browsing
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
Definitions
- the present disclosure relates to the field of information processing technology, for example, to a method, device, server, and storage medium for navigating and browsing text information.
- the present disclosure provides a method, device, server and storage medium for navigating and browsing text information, so as to realize automatic searching for similar or identical content in at least two documents to improve comparison efficiency.
- a method for navigating and browsing text information including:
- a navigation and browsing device for text information including:
- a first obtaining module configured to obtain a first text, wherein the first text includes first information
- the second obtaining module is configured to obtain a second text, wherein the second text includes information
- a matching module configured to match the first information and the second information to determine the similarity between the second information and the first information
- the navigation and browsing module is configured to navigate and browse the second text according to the similarity.
- a server including:
- One or more processors are One or more processors;
- Storage device set to store one or more programs
- the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the aforementioned method for navigating and browsing text information.
- a computer-readable storage medium is also provided, on which a computer program is stored, and when the program is executed by a processor, the above-mentioned method for navigating and browsing text information is realized.
- FIG. 1 is a schematic flowchart of a method for navigating and browsing text information according to Embodiment 1 of the present invention
- FIG. 2 is a schematic flowchart of a method for navigating and browsing text information according to Embodiment 2 of the present invention
- FIG. 3 is a schematic flowchart of another method for navigating and browsing text information according to Embodiment 2 of the present invention.
- FIG. 4 is a schematic flowchart of another method for navigating and browsing text information according to Embodiment 2 of the present invention.
- FIG. 5 is a schematic structural diagram of a text information navigation and browsing device provided in the third embodiment of the present invention.
- Fig. 6 is a schematic structural diagram of a server provided in the fourth embodiment of the present invention.
- first, second, etc. may be used herein to describe various directions, actions, steps or elements, etc., but these directions, actions, steps or elements are not limited by these terms. These terms are only used to distinguish a first direction, action, step or element from another direction, action, step or element.
- first information may be referred to as second information
- second information may be referred to as first information. Both the first information and the second information are information, but they are not the same information.
- the terms “first”, “second”, etc. cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Therefore, the features defined with “first” and “second” may explicitly or implicitly include one or more of these features.
- “multiple” and “batch” mean at least two, such as two, three, etc., unless specifically defined otherwise.
- Fig. 1 is a schematic flow chart of a method for navigating and browsing text information according to Embodiment 1 of the present invention, which can be applied to a scenario where text is compared.
- the method can be executed by a text information navigation and browsing device, which can be used It can be implemented by software and/or hardware, and can be integrated on the server.
- the method for navigating and browsing text information provided in the first embodiment of the present invention includes:
- the first text refers to the text that needs to be analyzed and compared.
- the first text can be a technical document, such as a dissertation, a patent document, a technical submission, or a technical solution for risk analysis, or part of the content in a patent document or a technical submission. , Such as the text of the technical solution described in the claims and technical disclosure documents, etc., which are not limited here.
- the first text is the claim.
- the first information refers to part or all of the information in the first text, and there is no restriction here.
- the first information is related information describing the technical solution in the first text. Taking the first text as the claim as an example, the first information can be one or more features in the claim, a sentence in the claim, or the entire claim, which is not limited here.
- the first information includes but is not limited to one or more of words, sentences or paragraphs.
- the user can select the first information in the first text as needed, or the system can select it by default. There is no restriction here.
- the first information is one or more. Taking the first information as a claim as an example, when there are multiple first information, multiple claims in the first text can be matched at the same time to find similar second information in the second text, which greatly improves the ratio. The efficiency of the file.
- the second text is a text that needs to be compared with the first text to determine whether it is similar to the technical solution recorded in the first text.
- the second text can be technical documents, books, patent documents, etc., or part of the content of technical documents, books, and patent documents, which is not limited here.
- the second text is the target comparison document.
- the second information refers to part or all of the information in the second text. There are one or more second information.
- the second information is related information describing the technical solution in the second text. Taking the second text as a similar patent document as an example, the second information can be the entire specification, a paragraph of the entire specification, or a sentence or word in the specification, which is not limited here.
- the second information includes one or more of words, sentences, or paragraphs.
- the second text can be obtained by manually importing the existing text into the navigation and browsing device of the text information. For example, if you find a text that you think is similar to the first text, you can download the text and import it into the navigation and browsing device of the text information to compare with the first information in the first text to determine the similar part and the corresponding position.
- the similarity refers to the degree of similarity between the first information and the second information. Matching refers to comparing the first information with the second information to determine the similarity.
- the similarity degree can be expressed in the form of percentage or color. For example, green represents a low degree of similarity, and red represents a high degree of similarity. There is no restriction on the form of similarity here.
- S140 Navigate and browse the second text according to the similarity.
- Navigating browsing refers to locating second information similar to the first information in the second text by matching similarity, so as to facilitate quick browsing without manual searching.
- you can set navigation marks of different colors on the side of the second text.
- the navigation marks correspond to the row positions of the second information similar to the first information.
- the user can quickly switch to the first information with higher similarity through the navigation marks.
- the second information is browsed; in an alternative embodiment, a quick browsing window can also be set to summarize the second information similar to the first information and sort the second information according to the similarity as the browsing index of the second text.
- the user can click on the corresponding summary Quickly browse the second information in the second text.
- the second information in the second text similar to the first information can be quickly obtained, which greatly improves the efficiency of comparison.
- step S140 navigating and browsing the second text according to the similarity may include:
- the second information and the first information are displayed on a navigation browsing interface according to the similarity.
- the navigation browsing interface refers to an interface that displays similarity matching results, and is used to find similar locations and content on the navigation browsing interface.
- the similarity matching result is a result of the similarity of the first information corresponding to one or more second information, and the similarity matching result reflects the similarity between the second information in the second text and the first information in the first text.
- the similarity matching result can be to display one or more second texts similar to the first information describing the technical solution in the form of all texts; it can also be to display only one or more second texts similar to the first information There is no restriction on the similar part of the.
- step S130 it may include:
- Chapter refers to part of the content in the second text.
- the chapters can be chapters such as claims, descriptions, etc., and can also be background technology, descriptions of drawings, and specific implementations. There is no restriction on the division of chapters here.
- step S140 it may include:
- the navigation and browsing interface also includes a similarity mark.
- the similar parts of the first information and the second information can be highlighted, which can help the user locate the similar parts as soon as possible.
- Content The way of highlighting can be highlighted, and there is no limitation here.
- a switch control is included in the search result, and the switch control is used to control switch display of a plurality of second information.
- the switch control can control to switch to the previous or next item, and can also switch to more similar or sub-similar similar parts. There is no restriction on how to switch the display here.
- obtaining the second text includes: receiving search information based on the first text; and searching the second text similar to the first text in a database based on the search information.
- the search information can be the text or graphic part of the first text about the first technical feature, or it can be automatically generated based on the first information, which is not limited here.
- the first text includes one or more pieces of first information; obtaining a second text, where the second text includes one or more pieces of second information;
- the first information and the second information determine the similarity between the second information and the first information; and the second text is navigated and browsed according to the similarity.
- the second text can automatically find information that is similar or identical to the first information, which can quickly confirm which parts of the second text are similar to the first information in the first text without manual Look for content related to the first message in the second text. It is possible to purposefully confirm the details of the matching results to achieve the effect of improving the efficiency of document retrieval. It solves the problem that the efficiency of comparing files is very low by manually searching for similar or identical content in the comparison files, and it realizes the effect of automatically searching for similar or identical content to improve the efficiency of comparing files.
- Fig. 2 is a schematic flowchart of a method for navigating and browsing text information according to the second embodiment of the present invention. This embodiment is described on the basis of the above technical solution, and is suitable for the scenario of comparing texts.
- the method can be executed by a text information navigation and browsing device, which can be implemented in software and/or hardware, and can be integrated on a server.
- the method for navigating and browsing text information provided by the second embodiment of the present invention includes:
- the first text refers to the text that needs to be analyzed and compared.
- the first text can be a technical document, such as a dissertation, a patent document, or a technical submission, or part of the content in a patent document or a technical submission, such as claims and technical submissions.
- the first text is the claim.
- the first information refers to part or all of the information in the first text, and there is no restriction here.
- the first information is related information describing the technical solution in the first text. Taking the first text as the claim as an example, the first information can be one or more features in the claim, a sentence in the claim, or the entire claim, which is not limited here.
- the first information includes but is not limited to one or more of words, sentences or paragraphs.
- the second text is a text that needs to be compared with the first text to determine whether it is similar to the technical solution recorded in the first text.
- the second text can be technical documents, books, patent documents, etc., or part of the content of technical documents, books, and patent documents, which is not limited here.
- the second text is the target comparison document.
- the second information refers to part or all of the information in the second text. There are one or more second information.
- the second information is related information describing the technical solution in the second text. Taking the second text as a similar patent document as an example, the second information can be the entire specification, a paragraph of the entire specification, or a sentence or word in the specification, which is not limited here.
- the second information includes one or more of words, sentences, or paragraphs.
- the first key feature refers to the feature related to the first technical feature in the first information.
- the first information may be one or more of words, sentences or paragraphs, and the first key feature may also be one or more of words, sentences or paragraphs. If the first information is a word, the first key feature is a word; if the first information is a sentence, the first key feature can be a sentence and/or a word; if the first information is a paragraph, the first key feature can be a paragraph , Sentences and/or words.
- the first key feature is a keyword.
- the first key feature can be extracted through the key feature extraction model.
- the key feature extraction model is a text-rank model.
- the text-rank model is a graph-based ranking model for text. By dividing the text into multiple constituent units (words, sentences) and building a graph model, the voting mechanism is used to rank important components in the text. The information of a single document itself can be used to extract keywords and abstracts.
- the first information is "a UAV emergency parachute opening system, which is used to open the parachute when the UAV fails, and it is characterized in that: the UAV emergency parachute opening system includes a main control module, Module, power management module, umbrella opening module", the first key feature can be UAV, umbrella opening system, main control module, detection module, power management module, umbrella opening module, etc., or it can be UAV emergency
- the umbrella opening system includes a main control module, a detection module, a power management module, and an umbrella opening module. There are no restrictions here.
- the second key feature refers to the feature related to the second technical feature in the second information.
- the second information may be one or more of words, sentences or paragraphs, and the second key feature may also be one or more of words, sentences or paragraphs. If the second information is a word, the second key feature is a word; if the second information is a sentence, the second key feature can be a sentence and/or a word; if the second information is a paragraph, the second key feature can be a paragraph , Sentences and/or words.
- the second key feature is a keyword.
- the second key feature may be a word, sentence, or paragraph.
- the second key feature can be extracted through the key feature extraction model.
- the key feature extraction model is a text-rank model.
- the text-rank model is a graph-based ranking model for text. By dividing the text into multiple constituent units (words, sentences) and building a graph model, the voting mechanism is used to rank important components in the text. The information of a single document itself can be used to extract keywords and abstracts.
- the second key feature can be a word, sentence, and/or paragraph. That is, when the first key feature is a word, the word of the first key feature can be the same as the word, sentence and/or sentence of the second key feature. Paragraphs are compared, there is no restriction here. Exemplarily, if the first key feature is an unmanned aerial vehicle and the second key feature is an unmanned aerial vehicle, the first key feature and the second key feature can be matched to determine the similarity between the first information and the second information.
- the similarity can be expressed in the form of percentage or color. For example, green represents low similarity, and red represents high similarity. There is no restriction on the form of similarity here.
- the similarity may be determined by a cosine similarity model and/or a word vector similarity summation model.
- the similarity can be determined through the word vector similarity summation model.
- the word vector similarity summation model refers to the model obtained by using the word vector similarity summation training;
- the cosine similarity model refers to a model trained using the cosine similarity algorithm. This embodiment does not limit the algorithm for calculating the similarity.
- Navigating browsing refers to locating second information similar to the first information in the second text by matching similarity, so as to facilitate quick browsing without manual searching.
- step S250 matching the first key feature and the second key feature to determine the similarity between the second information and the first information can be replaced by:
- S251 Perform vectorization on the first key feature based on the trained first comparison model to obtain a first vector result.
- the first comparison model refers to a model that vectorizes the first key feature.
- vectorization refers to expressing text as a series of vectors that can express the semantics of the text.
- the first comparison model includes a word to vector (Word2vec) model and/or a recursive neural network recursive autoencoder (recursive autoencoder) model.
- Word2vec word to vector
- recursive autoencoder recursive autoencoder
- S252 Perform vectorization on the second key feature based on the trained second comparison model to obtain a second vector result.
- the second comparison model refers to a model that vectorizes the second key feature.
- the first comparison model includes a Word2vec model and/or a recursive neural network recursive autoencoder model.
- the second comparison model includes a Word2vec model; when the second key feature is a sentence or a paragraph, the second comparison model includes a neural network recursive autoencoder model.
- the first comparison model includes the Word2vec model and the recursive autoencoder model of the recurrent neural network.
- the first comparison model and the second comparison model may use the same model or the same type of model.
- the similarity is determined only after the first key feature and the second key feature are vectorized. It is not only a mechanical comparison of words, but the similarity is determined based on the semantics of the key features, and the similarity is matched. The result is more accurate.
- step S230 extracting a first key feature from the first information includes:
- the preset rule refers to a rule for processing the first information, and the first processing result is obtained by processing the first information through the preset rule.
- Processing the first information based on preset rules to obtain the first processing result may include: acquiring text information, symbol information, and/or text structure information of the first information; based on the text information, symbol information, and/or The text structure information processes the first information to obtain the first processing result.
- the text information includes stop words.
- stop words include “the”, “and”, “or”, etc., which are not limited here.
- Processing the first information based on the text information to obtain the first processing result includes: analyzing and obtaining stop words in the first information; and extracting relevant information before and/or after the stop words .
- the first information is a sentence or paragraph.
- the text information may also include other related words, etc., which are not limited here.
- the symbol information includes semicolon and/or comma. Processing the first information based on the symbol information to obtain the first processing result includes: extracting related information before and/or after the semicolon and/or comma. Exemplarily, if the first information is "the drone includes a main control module and a flight module; the flight module includes a power supply unit", then relevant information such as “the main control module, flight module, and the flight module” is extracted. Optionally, the symbol information may also include other identifying symbols, which is not limited here.
- the text structure information includes a preamble part and a characteristic part
- processing the first information based on the text structure information to obtain the first processing result includes: extracting relevant information of the preamble part and/or the characteristic part.
- relevant information such as "unmanned aerial vehicle, flight module” is extracted.
- the text structure information may also include other text structure information, which is not limited here.
- the first information is processed by preset rules to extract key features
- the extraction method is simple and effective, and the efficiency of retrieving files is improved.
- the first text includes one or more pieces of first information
- obtaining a second text where the second text includes one or more pieces of second information
- the first information and the second information determine the similarity between the second information and the first information
- the second text is navigated and browsed according to the similarity.
- the second text can automatically find information that is similar or identical to the first information, which can quickly confirm which parts of the second text are similar to the first information in the first text without manual Look for content related to the first message in the second text. It is possible to purposefully confirm the details of the matching results, and achieve the effect of improving the efficiency of retrieving files.
- FIG. 5 is a schematic structural diagram of a text information navigation and browsing device provided in the third embodiment of the present invention. This embodiment can be applied to a scenario where text is compared.
- the device can be implemented by software and/or hardware, and Can be integrated on the server.
- the apparatus for navigation and browsing of text information may include a first obtaining module 310, a second obtaining module 320, a matching module 330, and a navigation browsing module 340, wherein:
- the first obtaining module 310 is configured to obtain a first text, and the first text includes one or more pieces of first information; the second obtaining module 320 is configured to obtain a second text, and the second text includes one or more pieces of information. Second information; a matching module 330, configured to match the first information and the second information to determine the similarity between the second information and the first information; the navigation and browsing module 340, configured to match the similarity Navigate and browse the second text.
- the navigation browsing module 340 includes: a display unit configured to display the first information and the second information on a navigation browsing interface according to the similarity.
- the matching module 330 includes: a first extraction unit configured to extract a first key feature from the first information; a second extraction unit configured to extract a second key feature from the second information; similarity The degree matching unit is configured to match the first key feature and the second key feature to determine the similarity between the second information and the first information.
- the device for navigating and browsing text information further includes: a first vectorization module configured to perform vectorization on the first key feature based on the trained first comparison model to obtain a first vector result; and a second vector
- the matching module 330 is set to vectorize the second key feature based on the trained second comparison model to obtain a second vector result; the matching module 330 is set to match the first vector result and the second vector result To determine the similarity between the second information and the first information.
- the first extraction unit includes: a first processing subunit configured to process the first information based on a preset rule to obtain a first processing result; and use the first processing result as the first key feature .
- the first processing subunit is configured to obtain text information, symbol information, and/or text structure information of the first information; The information is processed to obtain the first processing result.
- the text information includes stop words
- the first processing subunit is configured to analyze the stop words in the first information; and extract relevant information before and/or after the stop words.
- the symbol information includes a semicolon and/or a comma
- the first processing subunit is configured to extract related information before and/or after the semicolon and/or the comma.
- the text structure information includes a preamble part and a characteristic part
- the first processing subunit is configured to extract relevant information of the preamble part and/or the characteristic part.
- the second acquisition module 320 includes: a receiving unit configured to receive retrieval information based on a first text; a retrieval unit configured to retrieve the first text similar to the first text in a database based on the retrieval information Two text.
- the apparatus for navigating and browsing text information further includes: a chapter selection module configured to receive chapter selection information of the second text; and extract a corresponding chapter based on the chapter selection information as the second information.
- a chapter selection module configured to receive chapter selection information of the second text; and extract a corresponding chapter based on the chapter selection information as the second information.
- the device for navigating and browsing text information further includes: a sorting module configured to sort the second information according to the similarity.
- the navigation browsing interface further includes: a switching control, the switching control is set to control the switching display of a plurality of second information.
- the navigation browsing interface further includes a similar identifier
- the display unit includes a highlight display unit configured to highlight similar parts of the first information and the second information.
- the key feature is extracted through a text-rank model.
- the similarity is determined by a cosine similarity model and/or a word vector similarity summation model.
- the comparison model includes a Word2vec model and/or a recursive neural network recursive autoencoder model.
- the first information and the second information include one or more of words, sentences or paragraphs.
- the first text is a claim.
- the second text is a target comparison document.
- the navigation and browsing device for text information provided by the embodiment of the present invention can execute the navigation and browsing method for text information provided by any embodiment of the present invention, and has the corresponding functional modules and effects for the execution method.
- the navigation and browsing device for text information provided by the embodiment of the present invention can execute the navigation and browsing method for text information provided by any embodiment of the present invention, and has the corresponding functional modules and effects for the execution method.
- Fig. 6 is a schematic structural diagram of a server provided in the fourth embodiment of the present invention.
- Figure 6 shows a block diagram of an exemplary server 612 suitable for implementing embodiments of the present invention.
- the server 612 shown in FIG. 6 is only an example, and should not bring any limitation to the function and application scope of the embodiment of the present invention.
- the server 612 is represented in the form of a general server.
- the components of the server 612 may include, but are not limited to: one or more processors 616, a storage device 628, and a bus 618 connecting different system components (including the storage device 628 and the processor 616).
- the bus 618 represents one or more of several types of bus structures, including a storage device bus or a storage device controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any bus structure among multiple bus structures.
- these architectures include, but are not limited to, Industry Subversive Alliance (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (Video Electronics Standards) Association, VESA) local bus and Peripheral Component Interconnect (PCI) bus.
- the server 612 includes a variety of computer system readable media. These media may be any available media that can be accessed by the server 612, including volatile and non-volatile media, removable and non-removable media.
- the storage device 628 may include a computer system readable medium in the form of a volatile memory, such as a random access memory (RAM) 630 and/or a cache memory 632.
- the terminal 612 may include other removable/non-removable, volatile/nonvolatile computer system storage media.
- the storage system 634 may be configured to read and write a non-removable, non-volatile magnetic medium (not shown in FIG. 6, usually referred to as a "hard drive").
- a disk drive configured to read and write to a removable non-volatile disk (such as a "floppy disk") and a removable non-volatile optical disk such as a compact disc (Compact Disc Read) can be provided.
- each drive can be connected to the bus 618 through one or more data media interfaces.
- the storage device 628 may include at least one program product, and the program product has a set of (for example, at least one) program modules, and these program modules are configured to perform the functions of the embodiments of the present invention.
- a program/utility tool 640 having a set of (at least one) program module 642 may be stored in, for example, the storage device 628.
- Such program module 642 includes but is not limited to an operating system, one or more application programs, other program modules, and programs Data, each of these examples or a combination may include the realization of a network environment.
- the program module 642 generally executes the functions and/or methods in the embodiments described in the present disclosure.
- the server 612 can also communicate with one or more external devices 614 (such as keyboards, pointing terminals, displays 624, etc.), and can also communicate with one or more terminals that enable users to interact with the server 612, and/or communicate with
- the server 612 can communicate with any terminal (such as a network card, a modem, etc.) that communicates with one or more other computing terminals. Such communication can be performed through an input/output (I/O) interface 622.
- the server 612 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 620. As shown in FIG.
- the network adapter 620 communicates with other modules of the server 612 through the bus 618. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the server 612, including but not limited to: microcode, terminal drives, redundant processors, external disk drive arrays, and disk arrays. Independent Disks, RAID) systems, tape drives, and data backup storage systems.
- the processor 616 executes a variety of functional applications and data processing by running programs stored in the storage device 628, for example, to implement a method for navigating and browsing text information provided by any embodiment of the present invention.
- the method may include: obtaining the first A text, the first text includes one or more first information; obtain a second text, the second text includes one or more second information; match the first information and the second information to determine The similarity between the second information and the first information; and the second text is navigated and browsed according to the similarity.
- the first text includes one or more pieces of first information
- obtaining a second text where the second text includes one or more pieces of second information
- the first information and the second information determine the similarity between the second information and the first information
- the second text is navigated and browsed according to the similarity.
- the second text can automatically find information that is similar or identical to the first information, which can quickly confirm which parts of the second text are similar to the first information in the first text without manual Look for content related to the first message in the second text. It is possible to purposefully confirm the details of the matching results to achieve the effect of improving the efficiency of document retrieval.
- the fifth embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored.
- a method for navigating and browsing text information as provided in any embodiment of the present invention is implemented.
- the method may include: obtaining a first text, the first text including one or more first information; obtaining a second text, the second text including one or more second information; matching the first information and the first information
- the second information is used to determine the similarity between the second information and the first information; and the second text is navigated and browsed according to the similarity.
- the computer-readable storage medium of the embodiment of the present invention may adopt any combination of one or more computer-readable media.
- the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
- the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above.
- Examples of computer-readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Erasable Programmable Read-Only Memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- the computer-readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, apparatus, or device.
- the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and computer-readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
- the computer-readable medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
- the program code contained on the storage medium can be transmitted by any suitable medium, including but not limited to wireless, wire, optical cable, radio frequency (RF), etc., or any suitable combination of the foregoing.
- suitable medium including but not limited to wireless, wire, optical cable, radio frequency (RF), etc., or any suitable combination of the foregoing.
- the computer program code for performing the operations of the present disclosure can be written in one or more programming languages or a combination thereof.
- the programming languages include object-oriented programming languages—such as Java, Smalltalk, C++, and also conventional Procedural programming language-such as "C" language or similar programming language.
- the program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer, or entirely executed on the remote computer or terminal.
- the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to pass Internet connection).
- LAN local area network
- WAN wide area network
- Internet service provider for example, using an Internet service provider to pass Internet connection.
- the first text includes one or more pieces of first information
- obtaining a second text where the second text includes one or more pieces of second information
- the first information and the second information determine the similarity between the second information and the first information
- the second text is navigated and browsed according to the similarity.
- the second text can automatically find information that is similar or identical to the first information, which can quickly confirm which parts of the second text are similar to the first information in the first text without manual Look for content related to the first message in the second text. It is possible to purposefully confirm the details of the matching results to achieve the effect of improving the efficiency of document retrieval.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (31)
- 一种文本信息的导航浏览方法,包括:A method for navigating and browsing text information, including:获取第一文本,其中,所述第一文本包括第一信息;Acquiring a first text, where the first text includes first information;获取第二文本,其中,所述第二文本包括第二信息;Acquiring a second text, where the second text includes second information;匹配所述第一信息和所述第二信息以确定所述第二信息和所述第一信息的相似度;Matching the first information and the second information to determine the similarity between the second information and the first information;根据所述相似度对所述第二文本进行导航浏览。Navigating and browsing the second text according to the similarity.
- 如权利要求1所述的方法,其中,所述根据所述相似度对所述第二文本进行导航浏览,包括:The method according to claim 1, wherein the navigation and browsing of the second text according to the similarity comprises:根据所述相似度在导航浏览界面展示所述第二信息和所述第一信息。The second information and the first information are displayed on a navigation browsing interface according to the similarity.
- 如权利要求1所述的方法,其中,所述匹配所述第一信息和所述第二信息以确定所述第二信息和所述第一信息的相似度包括:The method of claim 1, wherein said matching said first information and said second information to determine the similarity between said second information and said first information comprises:在所述第一信息中提取第一关键特征;Extracting a first key feature from the first information;在所述第二信息中提取第二关键特征;Extracting a second key feature from the second information;匹配所述第一关键特征和所述第二关键特征以确定所述第二信息和所述第一信息的相似度。The first key feature and the second key feature are matched to determine the similarity between the second information and the first information.
- 如权利要求3所述的方法,在所述匹配所述第一关键特征和所述第二关键特征以确定所述第二信息和所述第一信息的相似度之前,还包括:5. The method according to claim 3, before said matching said first key feature and said second key feature to determine the similarity between said second information and said first information, further comprising:基于训练好的第一比对模型对所述第一关键特征进行向量化得到第一向量结果;Vectorizing the first key feature based on the trained first comparison model to obtain a first vector result;基于训练好的第二比对模型对所述第二关键特征进行向量化得到第二向量结果;Vectorizing the second key feature based on the trained second comparison model to obtain a second vector result;所述匹配所述第一关键特征和所述第二关键特征以确定所述第二信息和所述第一信息的相似度,包括:The matching the first key feature and the second key feature to determine the similarity between the second information and the first information includes:匹配所述第一向量结果和所述第二向量结果以确定所述第二信息和所述第一信息的相似度。The first vector result and the second vector result are matched to determine the similarity between the second information and the first information.
- 如权利要求3所述的方法,其中,所述在所述第一信息中提取第一关键特征,包括:The method of claim 3, wherein said extracting the first key feature from the first information comprises:基于预设规则对所述第一信息进行处理得到第一处理结果;Processing the first information based on a preset rule to obtain a first processing result;将所述第一处理结果作为所述第一关键特征。Use the first processing result as the first key feature.
- 如权利要求5所述的方法,其中,所述基于预设规则对所述第一信息进行 处理得到第一处理结果,包括:The method according to claim 5, wherein said processing said first information based on a preset rule to obtain a first processing result comprises:获取所述第一信息的文字信息、符号信息和文字结构信息中的至少之一;Acquiring at least one of text information, symbol information, and text structure information of the first information;基于获取的信息对所述第一信息进行处理得到所述第一处理结果。The first information is processed based on the acquired information to obtain the first processing result.
- 如权利要求6所述的方法,其中,所述文字信息包括停用词,基于所述文字信息对所述第一信息进行处理得到所述第一处理结果,包括:7. The method of claim 6, wherein the text information includes stop words, and processing the first information based on the text information to obtain the first processing result comprises:分析得到所述第一信息中的停用词;Analyze and obtain the stop words in the first information;提取所述停用词之前的相关信息和所述停用词之后的相关信息中的至少之一;Extract at least one of the related information before the stop word and the related information after the stop word;将提取的相关信息作为所述第一处理结果。Use the extracted relevant information as the first processing result.
- 如权利要求6所述的方法,其中,所述符号信息包括分号和顿号中的至少之一,基于所述符号信息对所述第一信息进行处理得到所述第一处理结果,包括:7. The method of claim 6, wherein the symbol information includes at least one of a semicolon and a comma, and processing the first information based on the symbol information to obtain the first processing result comprises:提取以下至少之一:所述分号之前的相关信息、所述顿号之前的相关信息、所述分号之后的相关信息、所述顿号之后的相关信息;Extract at least one of the following: related information before the semicolon, related information before the comma, related information after the semicolon, and related information after the comma;将提取的相关信息作为所述第一处理结果。Use the extracted relevant information as the first processing result.
- 如权利要求6所述的方法,其中,所述文字结构信息包括前序部分和特征部分,基于所述文字结构信息对所述第一信息进行处理得到所述第一处理结果,包括:7. The method according to claim 6, wherein the text structure information includes a preamble part and a characteristic part, and processing the first information based on the text structure information to obtain the first processing result comprises:提取所述前序部分的相关信息和所述特征部分的相关信息中的至少之一;Extracting at least one of the related information of the preamble part and the related information of the characteristic part;将提取的相关信息作为所述第一处理结果。Use the extracted relevant information as the first processing result.
- 如权利要求1所述的方法,在所述匹配所述第一信息和所述第二信息以确定所述第二信息和所述第一信息的相似度之前,还包括:The method according to claim 1, before said matching said first information and said second information to determine the similarity between said second information and said first information, further comprising:接收所述第二文本的章节选择信息;Receiving chapter selection information of the second text;基于所述章节选择信息提取对应的章节作为所述第二信息。Extracting a corresponding chapter based on the chapter selection information as the second information.
- 如权利要求2所述的方法,在所述根据所述相似度在导航浏览界面展示所述第二信息和所述第一信息之后,还包括:The method according to claim 2, after the displaying the second information and the first information on a navigation browsing interface according to the similarity, the method further comprises:根据相似度对多个第二信息进行排序。Sort the plurality of second information according to the similarity.
- 如权利要求2所述的方法,其中,所述导航浏览界面还包括切换控件,所述切换控件用于控制对多个第二信息进行切换显示。3. The method according to claim 2, wherein the navigation browsing interface further comprises a switch control, and the switch control is used to control the switch display of a plurality of second information.
- 如权利要求2所述的方法,其中,所述导航浏览界面还包括相似标识, 所述根据所述相似度在导航浏览界面展示所述第一信息和所述第二信息,包括:The method of claim 2, wherein the navigation browsing interface further includes a similarity identifier, and displaying the first information and the second information on the navigation browsing interface according to the similarity includes:在导航浏览界面对所述第一信息和所述第二信息的相似部分进行突出显示。The similar parts of the first information and the second information are highlighted on the navigation browsing interface.
- 如权利要求3所述的方法,其中,所述第一关键特征和所述第二关键特征均通过文本排名text-rank模型提取。The method of claim 3, wherein the first key feature and the second key feature are both extracted by a text-rank model.
- 如权利要求1所述的方法,其中,所述相似度通过余弦相似度模型和词向量相似度求和模型中的至少之一确定。The method according to claim 1, wherein the similarity is determined by at least one of a cosine similarity model and a word vector similarity summation model.
- 如权利要求4所述的方法,其中,所述第一比对模型和所述第二比对模型均包括词向量模型和递归神经网络模型中的至少之一。8. The method of claim 4, wherein the first comparison model and the second comparison model both comprise at least one of a word vector model and a recurrent neural network model.
- 如权利要求1所述的方法,其中,所述第一信息和所述第二信息均包括词语、句子和段落中的至少一种。The method of claim 1, wherein the first information and the second information each include at least one of words, sentences, and paragraphs.
- 如权利要求1所述的方法,其中,所述第一文本为权利要求书。The method of claim 1, wherein the first text is a claim.
- 如权利要求1所述的方法,其中,所述第二文本为目标对比文件。The method according to claim 1, wherein the second text is a target comparison document.
- 一种文本信息的导航浏览装置,包括:A navigation and browsing device for text information includes:第一获取模块,设置为获取第一文本,其中,所述第一文本包括第一信息;A first obtaining module, configured to obtain a first text, wherein the first text includes first information;第二获取模块,设置为获取第二文本,其中,所述第二文本包括第二信息;A second obtaining module, configured to obtain a second text, wherein the second text includes second information;匹配模块,设置为匹配所述第一信息和所述第二信息以确定所述第二信息和所述第一信息的相似度;A matching module, configured to match the first information and the second information to determine the similarity between the second information and the first information;导航浏览模块,设置为根据所述相似度对所述第二文本进行导航浏览。The navigation and browsing module is configured to navigate and browse the second text according to the similarity.
- 如权利要求20所述的装置,其中,所述导航浏览模块包括:The device of claim 20, wherein the navigation and browsing module comprises:展示单元,设置为根据所述相似度在导航浏览界面展示所述第二信息和所述第一信息。The display unit is configured to display the second information and the first information on a navigation browsing interface according to the similarity.
- 如权利要求20所述的装置,其中,所述匹配模块包括:The apparatus of claim 20, wherein the matching module comprises:第一提取单元,设置为在所述第一信息中提取第一关键特征;A first extraction unit, configured to extract a first key feature from the first information;第二提取单元,设置为在所述第二信息中提取第二关键特征;A second extraction unit, configured to extract a second key feature from the second information;相似度匹配单元,设置为匹配所述第一关键特征和所述第二关键特征以确定所述第二信息和所述第一信息的相似度。The similarity matching unit is configured to match the first key feature and the second key feature to determine the similarity between the second information and the first information.
- 如权利要求22所述的装置,还包括:The device of claim 22, further comprising:第一向量化模块,设置为基于训练好的第一比对模型对所述第一关键特征进行向量化得到第一向量结果;The first vectorization module is set to vectorize the first key feature based on the trained first comparison model to obtain a first vector result;第二向量化模块,设置为基于训练好的第二比对模型对所述第二关键特征 进行向量化得到第二向量结果;The second vectorization module is set to vectorize the second key feature based on the trained second comparison model to obtain a second vector result;匹配模块是设置为匹配所述第一向量结果和所述第二向量结果以确定所述第二信息和所述第一信息的相似度。The matching module is configured to match the first vector result and the second vector result to determine the similarity between the second information and the first information.
- 如权利要求22所述的装置,其中,所述第一提取单元包括:The apparatus of claim 22, wherein the first extraction unit comprises:第一处理子单元,设置为基于预设规则对所述第一信息进行处理得到第一处理结果;将所述第一处理结果作为所述第一关键特征。The first processing subunit is configured to process the first information based on a preset rule to obtain a first processing result; and use the first processing result as the first key feature.
- 如权利要求24所述的装置,其中,所述第一处理子单元是设置为通过如下方式基于预设规则对所述第一信息进行处理得到第一处理结果:The device of claim 24, wherein the first processing subunit is configured to process the first information based on a preset rule in the following manner to obtain the first processing result:获取所述第一信息的文字信息、符号信息和文字结构信息中的至少之一;基于获取的信息对所述第一信息进行处理得到所述第一处理结果。At least one of text information, symbol information, and text structure information of the first information is acquired; the first information is processed based on the acquired information to obtain the first processing result.
- 如权利要求20所述的装置,其中,所述第二获取模块包括:The apparatus of claim 20, wherein the second acquisition module comprises:接收单元,设置为接收基于所述第一文本的检索信息;A receiving unit, configured to receive retrieval information based on the first text;检索单元,设置为基于所述检索信息在数据库中检索与所述第一文本相似的所述第二文本。The retrieval unit is configured to retrieve the second text similar to the first text in the database based on the retrieval information.
- 如权利要求20所述的装置,还包括:The device of claim 20, further comprising:章节选择模块,设置为接收所述第二文本的章节选择信息;基于所述章节选择信息提取对应的章节作为所述第二信息。The chapter selection module is configured to receive chapter selection information of the second text; and extract a corresponding chapter based on the chapter selection information as the second information.
- 如权利要求21所述的装置,还包括:The device of claim 21, further comprising:排序模块,设置为根据相似度对多个第二信息进行排序。The sorting module is configured to sort the plurality of second information according to the similarity.
- 如权利要求21所述的装置,其中,所述导航浏览界面还包括相似标识,所述展示单元包括:21. The device of claim 21, wherein the navigation browsing interface further includes a similar identifier, and the display unit includes:突出显示单元,设置为在导航浏览界面对所述第一信息和所述第二信息的相似部分进行突出显示。The highlight display unit is configured to highlight similar parts of the first information and the second information on the navigation browsing interface.
- 一种服务器,包括:A server that includes:至少一个处理器;At least one processor;存储装置,设置为存储至少一个程序;The storage device is set to store at least one program;当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-19中任一项所述的文本信息的导航浏览方法。When the at least one program is executed by the at least one processor, the at least one processor implements the method for navigating and browsing text information according to any one of claims 1-19.
- 一种计算机可读存储介质,存储有计算机程序,其中,所述程序被处理器执行时实现如权利要求1-19中任一项所述的文本信息的导航浏览方法。A computer-readable storage medium storing a computer program, wherein when the program is executed by a processor, the method for navigating and browsing text information according to any one of claims 1-19 is realized.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910816838.0 | 2019-08-30 | ||
CN201910816838.0A CN112445891A (en) | 2019-08-30 | 2019-08-30 | Text information navigation browsing method, device, server and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021037012A1 true WO2021037012A1 (en) | 2021-03-04 |
Family
ID=74684562
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/110994 WO2021037012A1 (en) | 2019-08-30 | 2020-08-25 | Text information navigation and browsing method, apparatus, server and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112445891A (en) |
WO (1) | WO2021037012A1 (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102789452A (en) * | 2011-05-16 | 2012-11-21 | 株式会社日立制作所 | Similar content extraction method |
US20130054612A1 (en) * | 2006-10-10 | 2013-02-28 | Abbyy Software Ltd. | Universal Document Similarity |
US9852337B1 (en) * | 2015-09-30 | 2017-12-26 | Open Text Corporation | Method and system for assessing similarity of documents |
CN108763486A (en) * | 2018-05-30 | 2018-11-06 | 湖南写邦科技有限公司 | Paper duplicate checking method, terminal and storage medium based on terminal |
CN110162630A (en) * | 2019-05-09 | 2019-08-23 | 深圳市腾讯信息技术有限公司 | A kind of method, device and equipment of text duplicate removal |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108920633B (en) * | 2018-07-01 | 2021-12-03 | 湖北通远格知科技有限公司 | Paper similarity detection method |
CN109408826A (en) * | 2018-11-07 | 2019-03-01 | 北京锐安科技有限公司 | A kind of text information extracting method, device, server and storage medium |
-
2019
- 2019-08-30 CN CN201910816838.0A patent/CN112445891A/en active Pending
-
2020
- 2020-08-25 WO PCT/CN2020/110994 patent/WO2021037012A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130054612A1 (en) * | 2006-10-10 | 2013-02-28 | Abbyy Software Ltd. | Universal Document Similarity |
CN102789452A (en) * | 2011-05-16 | 2012-11-21 | 株式会社日立制作所 | Similar content extraction method |
US9852337B1 (en) * | 2015-09-30 | 2017-12-26 | Open Text Corporation | Method and system for assessing similarity of documents |
CN108763486A (en) * | 2018-05-30 | 2018-11-06 | 湖南写邦科技有限公司 | Paper duplicate checking method, terminal and storage medium based on terminal |
CN110162630A (en) * | 2019-05-09 | 2019-08-23 | 深圳市腾讯信息技术有限公司 | A kind of method, device and equipment of text duplicate removal |
Also Published As
Publication number | Publication date |
---|---|
CN112445891A (en) | 2021-03-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108875067B (en) | Text data classification method, device, equipment and storage medium | |
US10657325B2 (en) | Method for parsing query based on artificial intelligence and computer device | |
WO2021017721A1 (en) | Intelligent question answering method and apparatus, medium and electronic device | |
CN108052577B (en) | Universal text content mining method, device, server and storage medium | |
CN107992596B (en) | Text clustering method, text clustering device, server and storage medium | |
US9569506B2 (en) | Uniform search, navigation and combination of heterogeneous data | |
CN108549656B (en) | Statement analysis method and device, computer equipment and readable medium | |
CN110390054B (en) | Interest point recall method, device, server and storage medium | |
US20180341866A1 (en) | Method of building a sorting model, and application method and apparatus based on the model | |
CN106951503B (en) | Information providing method, device, equipment and storage medium | |
US20210358570A1 (en) | Method and system for claim scope labeling, retrieval and information labeling of gene sequence | |
CN110543592A (en) | Information searching method and device and computer equipment | |
WO2020232898A1 (en) | Text classification method and apparatus, electronic device and computer non-volatile readable storage medium | |
JP2020149686A (en) | Image processing method, device, server, and storage medium | |
WO2023024975A1 (en) | Text processing method and apparatus, and electronic device | |
US9436891B2 (en) | Discriminating synonymous expressions using images | |
CN112989010A (en) | Data query method, data query device and electronic equipment | |
CN107861948B (en) | Label extraction method, device, equipment and medium | |
KR20120047622A (en) | System and method for managing digital contents | |
CN114116997A (en) | Knowledge question answering method, knowledge question answering device, electronic equipment and storage medium | |
CN111563172B (en) | Academic hot spot trend prediction method and device based on dynamic knowledge graph construction | |
CN116383412B (en) | Functional point amplification method and system based on knowledge graph | |
WO2019071907A1 (en) | Method for identifying help information based on operation page, and application server | |
WO2021037012A1 (en) | Text information navigation and browsing method, apparatus, server and storage medium | |
CN117011581A (en) | Image recognition method, medium, device and computing equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20857281 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20857281 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.08.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20857281 Country of ref document: EP Kind code of ref document: A1 |