CN114385776A

CN114385776A - Information positioning method, storage medium and device

Info

Publication number: CN114385776A
Application number: CN202111642081.1A
Authority: CN
Inventors: 李卫东; 陈成帅; 张强; 鲍捷
Original assignee: Hefei Wenyin Internet Technology Co ltd; Memect Technology Co ltd
Current assignee: Hefei Wenyin Internet Technology Co ltd; Memect Technology Co ltd
Priority date: 2021-11-04
Filing date: 2021-12-29
Publication date: 2022-04-22

Abstract

The application discloses an information positioning method, a storage medium and a device. Obtaining a document to be predicted; performing chapter analysis on a document to be predicted to obtain a first single text chapter tree corresponding to the document to be predicted; acquiring a pre-trained aggregation tree, wherein at least one aggregation node in the aggregation tree has corresponding label information; fusing the first single-text chapter tree with the aggregation tree by using a preset clustering algorithm so as to endow label information of the aggregation tree to the first single-text chapter tree, and obtaining label information of each node to be predicted in the first single-text chapter tree, wherein the label information is used for positioning chapter positions of chapters to be predicted; the first single-text chapter tree can be fused to the aggregation tree to achieve accurate chapter positioning. By labeling a small number of documents and constructing the aggregation tree by using the labeled documents and the unlabeled data, the number of the documents to be labeled can be reduced, the document labeling cost and resources are saved, and the information positioning efficiency is improved.

Description

Information positioning method, storage medium and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to an information positioning method, a storage medium, and an apparatus.

Background

Different documents have different chapter structures and expression wordings. As the length of a document increases, the need for users to quickly locate a chapter from the document increases.

The conventional chapter positioning method includes: and labeling a large number of document samples to obtain label information. When information positioning is carried out on a certain document, based on label information input by a user, chapter positions matched with the label information are searched from the document, and a chapter positioning result is obtained.

However, to improve the accuracy of chapter location, a large number of document labels are required, and the more patterns observed, the more accurate the information location. But a larger label amount tends to mean a less efficient chapter location. And large-scale marking data are difficult to obtain under many scenes.

Disclosure of Invention

The embodiment of the disclosure provides an information positioning method, a storage medium and a device, so as to at least solve the technical problems that a large amount of document marking operations are required during information positioning and large-scale marked data is difficult to obtain in the prior art.

According to an aspect of the embodiments of the present disclosure, there is provided an information positioning method, including:

acquiring a document to be predicted;

performing chapter analysis on the document to be predicted to obtain a first single text chapter tree corresponding to the document to be predicted; the first single text chapter tree comprises at least one node to be predicted, each node to be predicted is used for storing chapter information corresponding to a chapter to be predicted, a father node of the node to be predicted is used for storing chapter information of a previous chapter of the chapter to be predicted corresponding to the node to be predicted, and a child node of the node to be predicted is used for storing chapter information of a next chapter of the chapter corresponding to the node to be predicted;

acquiring a pre-trained aggregation tree; the aggregation tree is obtained by fusing a second single text chapter tree corresponding to the unmarked document and a third single text chapter tree corresponding to the marked document by using a preset clustering algorithm; the preset clustering algorithm is used for clustering nodes with the same node characteristics to obtain aggregation nodes; at least one aggregation node in the aggregation tree has corresponding label information; and

and fusing the first single-text chapter tree with the aggregation tree by using the preset clustering algorithm so as to endow label information of the aggregation tree to the first single-text chapter tree, and obtain label information of each node to be predicted in the first single-text chapter tree, wherein the label information is used for positioning chapter positions of chapters to be predicted.

According to another aspect of the embodiments of the present disclosure, there is also provided a storage medium including a stored program, wherein the method of any one of the above is performed by a processor when the program is executed.

According to another aspect of the embodiments of the present disclosure, there is also provided an information positioning apparatus including:

the document acquisition module is used for acquiring a document to be predicted;

the chapter analysis module is used for performing chapter analysis on the document to be predicted to obtain a first single text chapter tree corresponding to the document to be predicted; the first single text chapter tree comprises at least one node to be predicted, each node to be predicted is used for storing chapter information corresponding to a chapter to be predicted, a father node of the node to be predicted is used for storing chapter information of a previous chapter of the chapter to be predicted corresponding to the node to be predicted, and a child node of the node to be predicted is used for storing chapter information of a next chapter of the chapter corresponding to the node to be predicted;

the aggregation tree acquisition module is used for acquiring a pre-trained aggregation tree; the aggregation tree is obtained by fusing a second single text chapter tree corresponding to the unmarked document and a third single text chapter tree corresponding to the marked document by using a preset clustering algorithm; the preset clustering algorithm is used for clustering nodes with the same node characteristics to obtain aggregation nodes; at least one aggregation node in the aggregation tree has corresponding label information; and

and the tree fusion module is used for fusing the first single text chapter tree with the aggregation tree by using the preset clustering algorithm so as to endow the label information of the aggregation tree with the first single text chapter tree to obtain the label information of each node to be predicted in the first single text chapter tree, wherein the label information is used for positioning the chapter position of the chapter to be predicted.

According to another aspect of the embodiments of the present disclosure, there is also provided an information positioning apparatus including: a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps:

acquiring a document to be predicted;

In the embodiment of the disclosure, a document to be predicted is subjected to chapter analysis to obtain a first single text chapter tree corresponding to the document to be predicted; acquiring a pre-trained aggregation tree, wherein at least one aggregation node in the aggregation tree has corresponding label information; fusing the first single-text chapter tree with the aggregation tree by using a preset clustering algorithm so as to endow label information of the aggregation tree to the first single-text chapter tree, and obtaining label information of each node to be predicted in the first single-text chapter tree, wherein the label information is used for positioning chapter positions of chapters to be predicted; because the hierarchical relationship among the document chapters can be represented in the form of parent-child nodes in the aggregation tree, the preset clustering algorithm can judge the identity of the nodes, and therefore, the accurate chapter positioning can be realized by fusing the first single-text chapter tree to the aggregation tree. Meanwhile, a small number of documents are labeled, and the aggregation tree is constructed by using the labeled documents and the unlabeled data, so that the number of the documents to be labeled can be reduced, the document labeling cost and resources are saved, and the information positioning efficiency is improved. And then the technical problems that a large amount of document marking operation is needed during information positioning and large-scale marked data is difficult to obtain in the prior art are solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:

fig. 1 is a hardware block diagram of a computing device for implementing the method according to embodiment 1 of the present disclosure;

fig. 2 is a schematic flow chart of an information positioning method according to the first aspect of embodiment 1 of the present disclosure;

fig. 3 is a schematic diagram of a process of constructing and applying an aggregation tree according to a first aspect of embodiment 1 of the present disclosure;

fig. 4 is a schematic diagram of a pre-training process of an aggregation tree according to a first aspect of embodiment 1 of the present disclosure;

fig. 5 is a schematic diagram of a training process of an aggregation tree according to a first aspect of embodiment 1 of the present disclosure;

fig. 6 is a schematic diagram of a prediction process of an aggregation tree according to a first aspect of embodiment 1 of the present disclosure;

fig. 7 is a schematic diagram of an information locating device according to a first aspect of embodiment 2 of the present disclosure; and

fig. 8 is a schematic diagram of an information locating apparatus according to the first aspect of embodiment 3 of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is to be understood that the described embodiments are merely exemplary of some, and not all, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some of the nouns or terms appearing in the description of the embodiments of the present disclosure are applicable to the following explanations:

tree: a data structure is a set with a hierarchical relationship, which is formed by n (n is more than or equal to 1) finite nodes.

And (3) node: the tree consists of a data element and branches pointing to its subtrees.

Single text chapter tree: a chapter tree consisting of a document. The chapter tree comprises at least one node, each node is used for storing chapter information of a corresponding chapter, a parent node of each node is used for storing chapter information of a chapter which is one chapter before the corresponding chapter of the node, and a child node of each node is used for storing chapter information of a chapter which is next to the corresponding chapter of the node.

Aggregation section tree: an aggregated tree fused from multiple single text chapter trees.

Example 1

In accordance with the present embodiment, there is provided a method embodiment of an information locating method, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer executable instructions and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than here.

The method embodiments provided by the present embodiment may be executed in a mobile terminal, a computer terminal, a server or a similar computing device. Fig. 1 shows a hardware block diagram of a computing device for implementing the information locating method. As shown in fig. 1, the computing device may include one or more processors (which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory for storing data, and a transmission device for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computing device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computing device. As referred to in the disclosed embodiments, the data processing circuit acts as a processor control (e.g., selection of a variable resistance termination path connected to the interface).

The memory may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the information positioning method in the embodiments of the present disclosure, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, implementing the information positioning method of the application program. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory located remotely from the processor, which may be connected to the computing device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device is used for receiving or transmitting data via a network. Specific examples of such networks may include wireless networks provided by communication providers of the computing devices. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computing device.

It should be noted here that in some alternative embodiments, the computing device shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that FIG. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in a computing device as described above.

According to a first aspect of the present embodiment, there is provided an information positioning method, and fig. 2 shows a flow chart of the method, and referring to fig. 2, the method includes:

s202: acquiring a document to be predicted;

s204: performing chapter analysis on a document to be predicted to obtain a first single text chapter tree corresponding to the document to be predicted; the first single text chapter tree comprises at least one node to be predicted, each node to be predicted is used for storing chapter information corresponding to a chapter to be predicted, a father node of the node to be predicted is used for storing chapter information of a previous chapter of the chapter to be predicted corresponding to the node to be predicted, and a child node of the node to be predicted is used for storing chapter information of a next chapter of the chapter corresponding to the node to be predicted;

s206: acquiring a pre-trained aggregation tree; and

s208: and fusing the first single-text chapter tree with the aggregation tree by using a preset clustering algorithm so as to endow the tag information of the aggregation tree to the first single-text chapter tree, and obtaining the tag information of each node to be predicted in the first single-text chapter tree, wherein the tag information is used for positioning the chapter position of the chapter to be predicted.

The document to be predicted comprises at least one level of section, and the label information in the aggregation tree is matched with the at least one level of section in the document to be predicted.

Specifically, the aggregation tree is obtained by fusing a second single-text chapter tree corresponding to the unlabeled document and a third single-text chapter tree corresponding to the labeled document by using a preset clustering algorithm. The preset clustering algorithm is used for clustering nodes with the same node characteristics to obtain aggregation nodes; at least one aggregation node in the aggregation tree has corresponding label information.

In this embodiment, the manner of converting a document into a corresponding single-text chapter tree (including the first single-text chapter tree, the second single-text chapter tree, and the third single-text chapter tree) includes: and analyzing the document by using a document analysis tool to obtain the single text chapter tree. Document parsing tools include, but are not limited to: a PDF chapter extraction tool, a word chapter extraction tool, or the like, and the implementation manner of the document analysis tool is not limited in this embodiment.

Referring to fig. 3, the second single-text chapter tree corresponding to the unlabeled document is used for pre-training the aggregation tree, and the third single-text chapter tree corresponding to the labeled document is used for training the aggregation tree. In other words, the aggregation tree generation and application in the present embodiment includes three stages of pre-training 31, training 32, and prediction 33. After the pre-training 31 and training 32 stages, the trained aggregation tree can be obtained.

Optionally, the unlabeled documents need not be labeled in advance. In this embodiment, the unlabeled documents are a corpus of documents of the same type that can be collected. In order to make the aggregation tree cover different chapter structures and expression terms as much as possible, the larger the document size of the unlabeled document is, the better.

In this embodiment, each unlabeled document is parsed into a chapter tree structure, that is, each unlabeled document is converted into a corresponding second single-text chapter tree.

Optionally, the annotation document needs to be pre-annotated. In this embodiment, the annotation of the document refers to the annotation of the corresponding label information in each chapter of the document.

Illustratively, the label information may be based on a semantic label of the chapter, or based on a preset classification label, and the labeling manner of the label information is not limited in this embodiment.

Optionally, label information is provided in the last level child node in the single text chapter tree and the aggregation tree; or, any one level node in the single text chapter tree and the aggregation tree has label information, and at this time, the label information is set according to the level of the node, and the label information corresponding to the upper level node includes label information corresponding to the lower level node. Such as: the label information of a certain node is personnel statistical information, and the label information corresponding to the next-level node of the node comprises female statistical information and male statistical information.

In this embodiment, the second number of annotated documents is less than the first number of unlabeled documents. In other words, the aggregation tree can be constructed by labeling a small number of sample documents, the labeled documents and the unlabeled documents are fused, and the aggregation tree can realize accurate chapter positioning through node traversal and chapter identity judgment. Meanwhile, the operation efficiency is higher because the tree-based traversal is used in the prediction.

Optionally, obtaining a pre-trained aggregation tree, comprising: converting the first quantity of unlabeled documents into a second single text chapter tree; fusing the second single text chapter tree by using a preset clustering algorithm to obtain a pre-training aggregation tree; converting the second quantity of the labeled documents into a third single text chapter tree; and fusing the third single text chapter tree to the pre-trained aggregation tree by using a preset clustering algorithm so as to endow the label information of the third single text chapter tree to the pre-trained aggregation tree and obtain the trained aggregation tree.

Wherein the nodes in the third single-text chapter tree include label information.

Optionally, the node features include: a hierarchy feature and an expression feature. Wherein the hierarchical structure feature is used for indicating the position of the node in the belonged tree. Therefore, when in fusion, the nodes with the same hierarchical structure in different trees can be fused, so that the structures of chapter information in different fused nodes are the same. The expression mode feature is used to indicate the expression mode of the chapter information. Therefore, nodes with the same expression mode in different trees can be fused during fusion, and the same expression mode of chapter information in different fused nodes is guaranteed.

In one example, the expression features include textual features; correspondingly, the preset clustering algorithm comprises a Word Move's Distance (WMD) algorithm to calculate a first Distance of text features of the chapter information in different nodes, and the text features of the chapter information in different nodes are the same when the first Distance is smaller than a first threshold.

WMD algorithm: the method is a measurement algorithm developed on the basis of word vectors and used for measuring the similarity of the documents. The WMD algorithm is a method for calculating the distance between sentences, and the smaller the calculated distance is, the higher the similarity is.

The calculation algorithm of the Word vector may be Word2Vec, or other algorithms for extracting the Word vector. In this embodiment, in order to improve the accuracy of extracting the expression, in another example, the expression features include semantic features; correspondingly, the preset clustering algorithm comprises a word embedding model and a WMD algorithm, so that the word embedding model is used for extracting the semantic features of the chapter information in different nodes, the WMD algorithm is used for calculating a second distance of the semantic features of different nodes, and the semantic features of the chapter information in different nodes are the same under the condition that the second distance is smaller than a second threshold value.

In this example, the word embedding model is trained using the full amount of data collected in advance as pre-training data. Compared with the traditional word vector generation algorithm, the word embedding model can generate dense word vector representation, so that the problem of vector sparsity in the training process is avoided; meanwhile, word embedding can better learn semantic information of words and mutual relations among the words, and can be better used for semantic generalization.

Meanwhile, compared with The editing distance or The Longest Common Subsequence (LCS) and Jaccard distance (Jaccard distance), The preset clustering algorithm also uses semantic features, so that The method has stronger generalization capability and higher accuracy.

The first threshold and the second threshold are the same or different, and the values of the first threshold and the second threshold are not limited in this embodiment.

Such as: refer to the pre-training process shown in fig. 4. First, each unlabeled document is converted into a second single-text chapter tree 41, and each second single-text chapter tree 41 represents one unlabeled document. Each node in the second single-text chapter tree stores chapter information for a corresponding chapter. The father node represents chapter information of a chapter at the upper layer in the chapter information; the child node represents chapter information of a chapter at a next layer in the chapter information. And fusing the second single text chapter tree by using a preset clustering algorithm to obtain a pre-training aggregation tree 42. The nodes in the pre-trained aggregation tree 42 have the same hierarchical structure features and the same expression features stored therein, for example, the child node "level one section a" in the pre-trained aggregation tree 42 includes "level one section a" of the second single-text section tree 41: a1 "and" first-level chapter a: chapter information of a2 ".

According to the process, the nodes of the plurality of second single text chapter trees are traversed in the pre-training stage, different nodes with the same node characteristics are aggregated in the traversing process according to the preset clustering algorithm to form a node cluster, and the chapter information in the same node cluster is in the same level and can be used for generating label information of the aggregated nodes in the aggregation tree. The more abundant the number of the pre-training texts is, the more abundant the contained information of the generated aggregation tree is, and the more complete the structure is.

Thereafter, reference is made to the training process shown in fig. 5. The annotation document is converted into a third single-text chapter tree 51, the nodes of which include the labels of the nodes of the last level. And traversing the pre-training aggregation tree 52 generated in the pre-training stage and the third single-text chapter tree 51 with label information, and combining the third single-text chapter tree 51 into the pre-training aggregation tree 52 by calculating the chapter identity among nodes, wherein the nodes combined in the pre-training aggregation tree 52 also have corresponding label information. By merging a plurality of third single-text chapter trees 51 with label information into the pre-trained aggregation tree 52, the obtained aggregation tree 53 will contain rich label information, and thus, accurate chapter positioning can be achieved.

Finally, reference is made to the prediction process shown in fig. 6. The document to be predicted is first parsed into a first chapter tree structure 61. According to the construction process of the aggregation tree, the first chapter tree structure 61 is fused into the nodes of the aggregation tree 62. The merging process is the process of labeling the first chapter tree 61. After the fusion is finished, the chapter position of the document to be predicted can be accurately positioned according to the label information 63 of the aggregation node in the aggregation tree 62, and sufficient early-stage preparation is provided for subsequent information extraction work.

Optionally, after step 208, further comprising: acquiring target label information; determining a target node with label information consistent with target label information from nodes to be predicted of the first single text chapter tree; and extracting information of chapter information stored in a target node in the document to be predicted.

Text information positioning is a joint link in an information extraction technology and is a key step of information extraction based on label learning. On one hand, the information positioning can improve the extraction efficiency by reducing the range of information extraction, which is particularly important for long texts; on the other hand, the accuracy of the final result can be improved by applying the macro context in information positioning, for example, in the extraction of named entity classes, a large amount of information such as names, time and the like can be extracted in the whole text, and if the information positioning is not carried out, wrong results are easily extracted.

In this embodiment, after the specific section is located, information extraction and analysis can be further performed on the specific section, and since the search range is limited, the information extraction accuracy can be improved to a great extent.

Thus, according to the first aspect of the present embodiment, by obtaining a document to be predicted; performing chapter analysis on a document to be predicted to obtain a first single text chapter tree corresponding to the document to be predicted; acquiring a pre-trained aggregation tree, wherein at least one aggregation node in the aggregation tree has corresponding label information; fusing the first single-text chapter tree with the aggregation tree by using a preset clustering algorithm so as to endow label information of the aggregation tree to the first single-text chapter tree, and obtaining label information of each node to be predicted in the first single-text chapter tree, wherein the label information is used for positioning chapter positions of chapters to be predicted; the problems that a large amount of document marking operation is required during information positioning and large-scale marking data is difficult to obtain in the prior art can be solved; because the hierarchical relationship among the document chapters can be represented in the form of parent-child nodes in the aggregation tree, the preset clustering algorithm can judge the identity of the nodes, and therefore, the accurate chapter positioning can be realized by fusing the first single-text chapter tree to the aggregation tree. Meanwhile, the operation efficiency is higher because the tree-based traversal is used in the prediction.

In addition, by labeling a small number of documents and constructing the aggregation tree by using the labeled documents and the unlabeled data, the number of the documents to be labeled can be reduced, the document labeling cost and resources are saved, and the information positioning efficiency is improved.

In addition, the preset clustering algorithm comprises a word embedding model and a WMD algorithm, and compared with the traditional word vector generation algorithm, the word embedding model can generate dense word vector representation, so that the problem of vector sparsity in the training process is avoided; meanwhile, word embedding can better learn semantic information of words and mutual relations among the words, and can be better used for semantic generalization. Meanwhile, compared with The editing Distance or The Longest Common Subsequence (LCS) and Jaccard Distance (Jaccard Distance), The preset clustering algorithm also uses semantic features, so that The method has stronger generalization capability and higher accuracy.

In addition, the specific part is further extracted and analyzed after the specific part is positioned to the appointed chapter, so that the information searching range can be limited, and the information extraction accuracy rate is improved.

Further, referring to fig. 1, according to a second aspect of the present embodiment, there is provided a storage medium. The storage medium comprises a stored program, wherein the method of any of the above is performed by a processor when the program is run.

Thus, according to the present embodiment, by obtaining a document to be predicted; performing chapter analysis on a document to be predicted to obtain a first single text chapter tree corresponding to the document to be predicted; acquiring a pre-trained aggregation tree, wherein at least one aggregation node in the aggregation tree has corresponding label information; fusing the first single-text chapter tree with the aggregation tree by using a preset clustering algorithm so as to endow label information of the aggregation tree to the first single-text chapter tree, and obtaining label information of each node to be predicted in the first single-text chapter tree, wherein the label information is used for positioning chapter positions of chapters to be predicted; the problems that a large amount of document marking operation is required during information positioning and large-scale marking data is difficult to obtain in the prior art can be solved; because the hierarchical relationship among the document chapters can be represented in the form of parent-child nodes in the aggregation tree, the preset clustering algorithm can judge the identity of the nodes, and therefore, the accurate chapter positioning can be realized by fusing the first single-text chapter tree to the aggregation tree. Meanwhile, the operation efficiency is higher because the tree-based traversal is used in the prediction.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

Fig. 7 shows an information locating device 700 according to the first aspect of the present embodiment, which device 700 corresponds to the method according to the first aspect of embodiment 1. Referring to fig. 7, the apparatus 700 includes:

a document obtaining module 710, configured to obtain a document to be predicted;

the chapter analysis module 720 is configured to perform chapter analysis on the document to be predicted to obtain a first single-text chapter tree corresponding to the document to be predicted; the first single text chapter tree comprises at least one node to be predicted, each node to be predicted is used for storing chapter information corresponding to a chapter to be predicted, a father node of the node to be predicted is used for storing chapter information of a previous chapter of the chapter to be predicted corresponding to the node to be predicted, and a child node of the node to be predicted is used for storing chapter information of a next chapter of the chapter corresponding to the node to be predicted;

an aggregation tree obtaining module 730, configured to obtain a pre-trained aggregation tree; the aggregation tree is obtained by fusing a second single text chapter tree corresponding to the unmarked document and a third single text chapter tree corresponding to the marked document by using a preset clustering algorithm; the preset clustering algorithm is used for clustering nodes with the same node characteristics to obtain aggregation nodes; at least one aggregation node in the aggregation tree has corresponding label information; and

and a tree fusion module 740, configured to fuse the first single-text chapter tree with the aggregation tree by using the preset clustering algorithm, so as to assign label information of the aggregation tree to the first single-text chapter tree, so as to obtain label information of each node to be predicted in the first single-text chapter tree, where the label information is used to locate chapter positions of chapters to be predicted.

Optionally, the aggregation tree obtaining module 730 is configured to:

converting a first number of unlabeled documents into the second single-text chapter tree;

fusing the second single text chapter tree by using a preset clustering algorithm to obtain a pre-training aggregation tree;

converting a second number of the labeled documents into a third single-text chapter tree, wherein nodes in the third single-text chapter tree comprise the label information, and the second number is smaller than the first number; and

and fusing the third single text chapter tree to the pre-trained aggregation tree by using a preset clustering algorithm so as to endow the label information of the third single text chapter tree to the pre-trained aggregation tree and obtain the trained aggregation tree.

Optionally, the node characteristics include: a hierarchy feature and an expression feature.

Optionally, the expression features comprise text features; correspondingly, the preset clustering algorithm comprises a word moving distance WMD algorithm to calculate a first distance of text features of the chapter information in different nodes, and the text features of the chapter information in different nodes are the same under the condition that the first distance is smaller than a first threshold value.

Optionally, the expression features comprise semantic features; correspondingly, the preset clustering algorithm comprises a word embedding model and a WMD algorithm, so that the word embedding model is used for extracting the semantic features of the chapter information in different nodes, the WMD algorithm is used for calculating the second distance of the semantic features of different nodes, and the semantic features of the chapter information in different nodes are the same under the condition that the second distance is smaller than a second threshold value.

Optionally, after the fusing the first single-text chapter tree with the aggregation tree by using the preset clustering algorithm, the apparatus further includes:

the tag acquisition module is used for acquiring target tag information;

the node positioning module is used for determining a target node of which the label information is consistent with the target label information from the nodes to be predicted of the first single text chapter tree; and

and the information extraction module is used for extracting the information of the chapters and sections stored by the target node in the document to be predicted.

Example 3

Fig. 8 shows an information locating device 800 according to the first aspect of the present embodiment, which device 800 corresponds to the method according to the first aspect of embodiment 1. Referring to fig. 8, the apparatus 800 includes: a processor 810; and a memory 820, connected to the processor 810, for providing instructions to the processor 810 to process the following steps:

acquiring a document to be predicted;

Optionally, the obtaining a pre-trained aggregation tree includes:

Optionally, after the fusing the first single-text chapter tree with the aggregation tree by using the preset clustering algorithm, the method further includes:

acquiring target label information;

determining a target node with label information consistent with the target label information from the nodes to be predicted of the first single text chapter tree; and

and extracting information of the chapter information stored by the target node in the document to be predicted.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, which can store program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. An information positioning method, comprising:

acquiring a document to be predicted;

2. The method of claim 1, wherein obtaining the pre-trained aggregation tree comprises:

3. The method of claim 1, wherein the node characteristics comprise: a hierarchy feature and an expression feature.

4. The method of claim 3,

the expression features comprise text features; correspondingly, the preset clustering algorithm comprises a word moving distance WMD algorithm to calculate a first distance of text features of the chapter information in different nodes, and the text features of the chapter information in different nodes are the same under the condition that the first distance is smaller than a first threshold value.

5. The method of claim 3,

the expression features comprise semantic features; correspondingly, the preset clustering algorithm comprises a word embedding model and a WMD algorithm, so that the word embedding model is used for extracting the semantic features of the chapter information in different nodes, the WMD algorithm is used for calculating the second distance of the semantic features of different nodes, and the semantic features of the chapter information in different nodes are the same under the condition that the second distance is smaller than a second threshold value.

6. The method of claim 1, wherein after fusing the first single-text chapter tree with the aggregation tree using the pre-set clustering algorithm, further comprising:

acquiring target label information;

7. A storage medium comprising a stored program, wherein the method of any one of claims 1 to 6 is performed by a processor when the program is run.

8. An information locating device, comprising:

9. The apparatus of claim 8, wherein the aggregation tree acquisition module is configured to:

10. An information locating device, comprising:

a processor; and

acquiring a document to be predicted;