CN118170836A - File knowledge extraction method and device based on structure priori knowledge - Google Patents

File knowledge extraction method and device based on structure priori knowledge Download PDF

Info

Publication number
CN118170836A
CN118170836A CN202410592269.7A CN202410592269A CN118170836A CN 118170836 A CN118170836 A CN 118170836A CN 202410592269 A CN202410592269 A CN 202410592269A CN 118170836 A CN118170836 A CN 118170836A
Authority
CN
China
Prior art keywords
feature
features
model
information
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410592269.7A
Other languages
Chinese (zh)
Other versions
CN118170836B (en
Inventor
马兵
尹旭
王玉石
许然然
王明华
王永婷
张晓良
董航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Energy Shuzhiyun Technology Co ltd
Original Assignee
Shandong Energy Shuzhiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Energy Shuzhiyun Technology Co ltd filed Critical Shandong Energy Shuzhiyun Technology Co ltd
Priority to CN202410592269.7A priority Critical patent/CN118170836B/en
Publication of CN118170836A publication Critical patent/CN118170836A/en
Application granted granted Critical
Publication of CN118170836B publication Critical patent/CN118170836B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a method and a device for extracting archival knowledge based on structure priori knowledge, which relate to the technical field of data processing. And the multi-feature information is subjected to feature fusion based on the feature correlation of the multi-feature information, and the feature extraction and the data dimension reduction are performed, so that the processing effect of the model on the long-distance dependence problem of the data can be enhanced, and the generalization capability of the model is improved. And the structure priori knowledge of deep feature representation is introduced to construct a label prediction model so as to identify the entity of the key information, so that the model can better understand the structure and the feature of the data, the dependency relationship among the entities and the context information thereof can be utilized more effectively, and the accuracy and the robustness of the entity identification are obviously improved.

Description

File knowledge extraction method and device based on structure priori knowledge
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for extracting archival knowledge based on structure priori knowledge.
Background
In the present digital age, a large amount of document and archive information needs to be effectively managed and utilized, and especially in the field of Chinese archives, traditional information extraction methods face a great challenge due to the complexity of Chinese itself and the diversity of archive contents. The Chinese characters have complex structures and contain abundant radical and font information, which is very critical for understanding the semantics of the Chinese characters, but the information is often not fully utilized in the traditional archival knowledge extraction method. In addition, the conventional method often has unsatisfactory effects when processing long-distance dependency relationships, complex data structures and complex logic relationships among entities, resulting in lower accuracy and efficiency of information extraction.
The prior art has the following technical problems: (1) The prior art may not fully utilize the structure and font characteristics of text data, particularly when processing chinese documents with rich structure and semantic information, limiting the accuracy of entity recognition; (2) In dealing with complex data structures and long-range dependency problems, the prior art may fail to effectively fuse features from different sources, resulting in limited model generalization capability; (3) The prior art may fail to take full advantage of structural prior knowledge to enhance accuracy and robustness of entity identification.
Disclosure of Invention
In view of the above, the present invention aims to provide a method and a device for extracting archival knowledge based on structure priori knowledge, which can enhance accuracy of entity identification.
In a first aspect, an embodiment of the present invention provides a method for extracting archival knowledge based on structure priori knowledge, where the method includes: acquiring a target document; extracting multiple features of the target document to obtain multiple feature information in the target document; the multi-feature information comprises structural features and font features; inputting the multi-feature information into a pre-constructed feature fusion model, and carrying out feature fusion on the multi-feature information based on the feature correlation of the multi-feature extraction features to generate fusion features; performing feature extraction and data dimension reduction on the fusion features to obtain key information in the fusion features; and carrying out entity recognition on the key information through a pre-constructed label prediction model based on structure priori knowledge, and outputting the key entity information in the target document.
In a second aspect, an embodiment of the present invention further provides an archive knowledge extraction device based on structure priori knowledge, where the apparatus includes: the data acquisition module is used for acquiring a target document; the feature extraction module is used for extracting multiple features of the target document to obtain multiple feature information in the target document; the multi-feature information comprises structural features and font features; the feature fusion module is used for inputting the multi-feature information into a pre-constructed feature fusion model, and carrying out feature fusion on the multi-feature information based on the feature correlation of the multi-feature extraction features to generate fusion features; the preprocessing module is used for carrying out feature extraction and data dimension reduction on the fusion features to obtain key information in the fusion features; and the output module is used for carrying out entity identification on the key information through a pre-constructed label prediction model based on structure priori knowledge and outputting the key entity information in the target document.
The embodiment of the invention has the following beneficial effects: according to the archive knowledge extraction method and device based on the structure priori knowledge, semantic information of Chinese characters can be more accurately captured by combining the detailed extraction of the structural features and the font features of the Chinese characters, and rich and accurate feature representation is provided for entity identification. After the multi-feature information is obtained, feature fusion is carried out based on feature correlation, so that the processing capacity of the model on complex data structures and the processing effect on long-distance dependence problems in sequence data can be enhanced, and the generalization capacity of the model is improved. In addition, the label prediction model is constructed by introducing the structure priori knowledge of deep feature representation, so that the model can better understand the structure and the features of the data, and therefore, the embodiment of the invention can more effectively utilize the dependency relationship among the entities and the context information thereof, and the accuracy and the robustness of entity identification are obviously improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for training an archive knowledge extraction model based on structure priori knowledge according to an embodiment of the present invention;
FIG. 2 is a flowchart of another method for training an archive knowledge extraction model based on structure priori knowledge according to an embodiment of the present invention;
FIG. 3 is a block diagram of a multi-feature extraction model according to an embodiment of the present invention;
Fig. 4 is a schematic structural diagram of an archive knowledge extraction model training device based on structure priori knowledge according to an embodiment of the present invention;
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purposes of clarity, technical solutions, and advantages of the embodiments of the present disclosure, the following description describes embodiments of the present disclosure with specific examples, and other advantages and effects of the present disclosure will be readily apparent to those skilled in the art from the disclosure herein. It will be apparent that the described embodiments are merely some, but not all embodiments of the present disclosure. The disclosure may be embodied or practiced in other different specific embodiments, and details within the subject specification may be modified or changed from various points of view and applications without departing from the spirit of the disclosure. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.
In order to solve the technical problems, the embodiment of the invention provides a method and a device for extracting archival knowledge based on structure priori knowledge, which can enhance the accuracy of entity identification.
Example 1
For the understanding of the present embodiment, first, a detailed description is given of a method for extracting archival knowledge based on structure priori knowledge disclosed in the embodiment of the present invention, fig. 1 shows a flowchart of a method for extracting archival knowledge based on structure priori knowledge provided in the embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:
Step S101, a target document is acquired.
Step S102, multi-feature extraction is carried out on the target document, and multi-feature information in the target document is obtained.
The document contains characters such as characters, places and the like, the characters represent different entities, and in order to acquire entity information in the document, the embodiment of the invention performs multi-feature extraction on the content in the document in advance so as to identify the entities in the document based on the extracted multi-feature information. The file for knowledge extraction is a Chinese file, and generally, chinese characters have component and radical structures.
Specifically, the character pattern and the structural characteristics represent the external shape of the Chinese character, the combination form of radicals, the position relation of the radicals and the overall construction mode, and the characteristics are helpful for revealing the semantic and functional attributes of the words, thereby being helpful for the accuracy of entity identification. By extracting the structural features and the font features of each word in the document, the understanding capability of the model on the document content can be enhanced by utilizing the vision and the structural attributes of Chinese characters, especially the semantic understanding is enhanced when Chinese text is processed, and the accuracy and the robustness of the model are improved. Specifically, semantics contained in the document can be determined according to the text content combination, and further, entities in the document can be determined.
Step S103, inputting the multi-feature information into a pre-constructed feature fusion model, and carrying out feature fusion on the multi-feature information based on the feature correlation of the multi-feature information to generate fusion features.
The extracted multi-feature information is expressed as high-dimensional feature vectors, and the vectors can comprehensively express various attributes of the text. For example, a glyph feature vector may include encodings of stroke numbers, stroke types, and correlations; the structural feature vectors may then include the type, number of the encoded radicals and their layout information in Chinese characters.
The embodiment of the invention combines various extracted features, captures the dynamic correlation among the features, enhances the processing capacity of the model on complex data structures, and effectively solves the problem of long-distance dependence in sequence data. The method has the advantages that the information extraction or classification task effect is improved, the capability of processing Chinese documents can be remarkably improved by extracting and fusing various features, and the defects of the prior art in the aspects of processing Chinese complexity and diversity are overcome.
In specific implementation, the embodiment of the invention processes the new sample by using the trained model to realize archival knowledge extraction. In one embodiment, entity information such as key participants, dates, places, etc. is extracted from the meeting summary document. First, multi-feature extraction is performed on a new meeting summary document. Further, feature fusion is performed using a trained feature fusion model.
And step S104, carrying out feature extraction and data dimension reduction on the fusion features to obtain key information in the fusion features.
Step S105, entity recognition is carried out on the key information through a pre-constructed label prediction model, and the key entity information in the target document is output.
In order to improve the recognition accuracy, the embodiment of the invention also performs feature extraction and data dimension reduction on the fusion features so as to reduce the computational complexity and simultaneously retain key information. In particular implementations, the document may be processed through a pre-trained model to obtain high-dimensional feature vectors. Further, the feature vector is reduced in dimension by using a trained feature dimension reduction model, so that the computational complexity is reduced while key information is reserved.
And further, performing label prediction on the feature vector subjected to dimension reduction by using the trained model, and performing entity identification and classification. The label prediction model based on the structure priori knowledge is used for carrying out entity identification and classification based on labels, wherein the labels comprise specific entity categories in the text, such as names of people. Further, according to the model prediction result, key entity information in the document, such as a participant list, conference date and place and the like, is output.
The embodiment of the invention introduces the structure priori knowledge of deep feature representation to construct the label prediction model, so that the model can better understand the structure and the features of the data. Furthermore, by combining the detailed extraction of the structural features and the font features of the Chinese characters, multi-feature information is obtained, and after fusion information and key information are obtained, the semantic information of the Chinese characters can be more accurately captured, and rich and accurate feature representation is provided for entity identification.
In summary, the archive knowledge extraction method based on the structure priori knowledge provided by the embodiment of the invention can more effectively utilize the dependency relationship among the entities and the context information thereof, and particularly remarkably improves the accuracy and the robustness of entity identification when processing complex entity relationship and rich context information.
Example two
Further, the prior art may lack the ability to search and extract features at multiple granularities, and lack efficient bi-directional enhancement mechanisms to optimize the feature extraction process, and furthermore, during feature compression and dimension reduction, the prior art may fail to efficiently retain critical interaction information, resulting in information loss in high-dimensional feature space, while also potentially affecting processing efficiency. Therefore, on the basis of the foregoing embodiment, the embodiment of the present invention further provides another archive knowledge extraction method based on structure priori knowledge, and fig. 2 shows a flowchart of another archive knowledge extraction method based on structure priori knowledge provided by the embodiment of the present invention, as shown in fig. 2, where the method includes the following steps:
step S201, a target document is acquired.
Step S202, inputting the target document into a pre-constructed multi-feature extraction model, and performing multi-feature extraction on the target document through the multi-feature extraction model to obtain multi-feature information in the target document.
In specific implementation, the embodiment of the invention performs multi-feature extraction through a pre-constructed multi-feature extraction model to determine the character pattern features and the structural features in the document. The multi-feature extraction model comprises a structural feature extraction model and a font feature extraction model; the structural feature extraction model is used for extracting structural features of the characters in the target document, and the font feature extraction model is used for extracting font features of the characters in the target document.
In the aspect of structural feature extraction, the invention identifies and extracts the radicals and structural features in Chinese characters through a deep learning model. Compared with the traditional method, the method not only identifies the basic components of the Chinese characters, but also further analyzes the construction modes of the Chinese characters, such as the combination form and the relative position relation of the components, which are deep features difficult to reach by the traditional method. The fusion of the components and the structural features can improve the understanding depth of Chinese characters and the semantics thereof, enhance the adaptability and the accuracy of the model under specific data sets and situations, and particularly can remarkably improve the accuracy of information extraction in the processing of Chinese texts. Based on the method, the semantic information of the Chinese character can be captured more accurately, and richer and more accurate feature representation is provided for subsequent entity identification.
In the aspect of character pattern feature extraction, the invention processes Chinese character documents through an improved convolutional neural network model, and in one embodiment, a VGG16 network (one type of convolutional neural network) architecture is adopted as a basis, but an input layer of the convolutional neural network is specially designed to adapt to the specificity of Chinese characters. The method can effectively extract the high-dimensional characteristics of the Chinese character patterns, the characteristics not only comprise strokes and structures of the Chinese characters, but also refine the characteristics to microscopic characteristics such as bending degree, thickness change and the like of the strokes, enriches the characteristic set and provides a basis for the accuracy of archive knowledge extraction.
In particular, fig. 3 shows a block diagram of a multi-feature extraction model that is capable of processing each word to obtain a respective corresponding feature vector. The method for constructing the multi-feature extraction model comprises the following steps:
1) And acquiring a Chinese file collected in advance, and collecting Chinese character data in the Chinese file.
2) And labeling the Chinese character data according to the Chinese character characteristics of the Chinese character data to obtain a data label.
The method comprises the steps of firstly collecting data from Chinese files, and labeling the collected data according to Chinese character characteristics to construct a multi-characteristic extraction model. The Chinese character features comprise radicals and fonts of Chinese character data, and a data label can be generated based on the features of the radicals, the fonts and the like of the Chinese characters, for example, the radicals of the Chinese characters are divided into the Chinese characters and the Chinese characters, for example, the Chinese characters are marked as the Chinese characters, for example, the Chinese characters, for which the radicals are the Chinese characters, for which the Chinese characters are the Chinese characters, for example, the Chinese characters, for which the Chinese characters correspond to the Chinese characters, for example, the places, are determined by a label prediction model based on the data label. In one embodiment, the Chinese document is a meeting summary document, and the labels are obtained by collecting Chinese character data in the meeting summary document and labeling the data to train the multi-feature extraction model.
3) Splitting and extracting structural features of the Chinese character data according to the radical structure by utilizing a first convolutional neural network, and determining structural feature representation; and capturing the character form detail characteristics of the Chinese character data through the VGG16 network.
After marking the data in the document, the embodiment of the invention respectively processes the structural features and the font features by using different models, then splices the structural features and the font features into a training sample, trains a third model based on the training sample, and constructs a multi-feature extraction model based on the trained third model.
A-determining a structural feature representation by:
In specific implementation, the convolutional neural network is utilized to split and extract structural features of each Chinese character according to the radical structure. The convolutional neural network learns the combination mode and the structural features of the radicals of the Chinese characters, converts the features into a vector form and generates a structural feature representation for each Chinese character.
Specifically, given a set of Chinese archival dataWherein/>Representing individual documents, each document containing a series of Chinese characters/>. For each Chinese character/>Obtaining the radical set/>, by structural decompositionWherein/>Is Chinese character/>Is a radical of a radical formula (i).
Further, the object of the structural feature extraction is to extract each radicalMapping to a high-dimensional feature space can be expressed as:
Wherein, Is radicals/>Structural feature representation of/>And/>Is a weight and bias parameter in convolutional neural networks,/>Is a Sigmoid nonlinear activation function.
Further, the method comprises the steps of,And/>Is the weight and bias obtained by convolutional layer learning,/>The update of (2) depends on a gradient descent method, and the calculation mode can be expressed as follows:
Wherein, Is the learning rate of the weight,/>Is a loss function,/>Is/>Pair/>Is a gradient of (a). In one embodiment of the present invention, in one embodiment,Set to 0.01. And/>The update of (2) also follows a gradient descent method, and the calculation mode can be expressed as:
Wherein, Is/>Pair/>Gradient of/>Is the learning rate of the bias.
B-capturing the character form detail characteristics of the Chinese character data by the following steps:
in a specific implementation, the VGG16 network is utilized to process Chinese characters to extract font characteristics. VGG16 networks are characterized by their depth and simplicity and comprise 16 layers including 13 convolutional layers, 5 pooled layers and 3 fully connected layers. The VGG16 network uses a 3x3 small convolution kernel for feature extraction and nonlinear activation by a ReLU activation function, which is widely used in subsequent convolution layers. After each convolutional layer, VGG16 performs spatial downsampling using a 2x2 max-pooling layer to extract the dominant features and reduce the number of parameters.
The network captures the detail features of Chinese character patterns through deep convolution and pooling operations and encodes the features into high-dimensional feature vectors, in one embodiment, chinese charactersCharacter pattern features of (a) are extracted through VGG16 network, and for given Chinese character document/>Glyph feature extraction can be expressed as:
Wherein, Representing VGG16 network corresponding to input character through input layer,/>Is Chinese character/>Is represented by the glyph features.
4) And splicing the structural feature representation and the font detail feature into a comprehensive feature vector, training the second convolutional neural network model through the comprehensive feature vector, and optimizing model parameters of the second convolutional neural network model by adopting an Adam optimizer.
And determining the context characteristics corresponding to the Chinese character data according to the structural characteristic representation and the font detail characteristics, wherein the context characteristics represent the association between the structural characteristic representation and the font detail characteristics. Is provided withThe contextual features of the document, which are generated jointly by the structural features and the glyph features, can be expressed as:
Wherein, And/>Weight matrix and bias vector, respectively,/>Representing feature stitching.
Further, byThe function evaluates the relevance of the structural feature representation and the font detail feature with the context feature respectively; based on the correlation, feature weights respectively corresponding to the structural feature representation and the font detail feature are calculated. Based onThe dynamic weights of the structural features and the glyph features are calculated and can be expressed as:
Wherein, And/>The structural feature weights and the dynamic weights of the glyph features, respectively. /(I)The function used to evaluate the relevance of a feature to a context can be expressed as:
Wherein, And/>Is a weight and bias parameter in convolutional neural networks,/>Is a weight matrix, and F x is the feature being evaluated.
Further, the structural feature representation and the font detail feature are stitched into a composite feature vector based on the feature weights. Specifically, by a splicing function based on dynamic feature correction mechanismThe method is realized, interaction and complementation of two features are considered, and the weights of the structural features and the font features in the splicing process are dynamically adjusted by evaluating the context information in the document, so that the feature weights can be adaptively adjusted according to the specific content of the document and the importance of the features, and further, the feature representation is optimized and can be expressed as follows:
Further, the calculation of α struct is based on the context feature F context of the document.
After the comprehensive feature vector is obtained, training the second convolutional neural network model through the comprehensive feature vector, wherein the specific steps are as follows:
The integrated feature vector is input into another convolutional neural network model (namely, a second convolutional neural network model) so as to learn the mapping relation between the Chinese characters and the semantic information thereof. Specifically, given a composite feature vector Training a convolutional neural network model/>To predict the label of a chinese character (i.e., the data label described above), the goal of model training is to minimize the difference between the predicted label and the real label, expressed as using a cross entropy loss function:
Wherein, Is a real tag,/>Is a model/>For Chinese characters/>Is added to traverse all training samples.
Further, predictive tagsCalculated by a preset softmax function, can be expressed as:
Wherein, Is the original fraction of the softmax function output,/>Is the total number of categories.
Further, an Adam optimizer is adopted for parameter optimization of the model, and the updating rule is as follows:
Wherein, Is at the/>Model parameters of the next iteration,/>Is at the/>Model parameters for the next iteration. /(I)Is a learning rate, which is preset by human beings. /(I)And/>Bias correction values for first and second order matrix estimates, respectively,/>Is a small constant added to maintain numerical stability. In one embodiment,/>Set to 0.01,/>Set to 0.001.
Further, the method comprises the steps of,And/>The calculation of (2) can be expressed as:
Wherein, For the first order matrix estimation of the t-th iteration,/>First order matrix estimation for the t-1 th iteration; /(I)Is the exponential decay rate of the first order matrix estimate,/>Is the exponential decay rate of the second order matrix estimate. /(I)A loss function for the t-th iteration; /(I)Second-order matrix estimation for the t-th iteration;
5) And constructing a multi-feature extraction model based on the trained second convolutional neural network model.
In conclusion, a built multi-feature extraction model is obtained, the multi-feature extraction is carried out on the target document through the model, and the obtained features are multi-feature information.
And then, carrying out feature fusion on the multi-feature information. The invention provides a cracking characteristic-based transform algorithm, which is improved by utilizing a Chebyshev theory to optimize a characteristic fusion process, improve the efficiency and accuracy of the algorithm, enhance the processing capacity of a model on a complex data structure, and effectively solve the problem of long-distance dependence in sequence data. Specifically, refer to the following steps S203 to S206.
And step S203, performing feature cracking processing on the multi-feature information to obtain a sub-feature set with fine granularity.
In order to capture the correlation among features more carefully, the embodiment of the invention carries out feature cracking processing on the structural features and the font features, and decomposes the features into sub-feature sets with finer granularity. For constructional featuresAnd font feature/>These features are subdivided by the cracking process to capture finer granularity information, let the cracking operation be/>The characteristics after cracking are expressed as:
In one embodiment, the cracking process The objective of (1) is to subdivide the original features into smaller sub-feature sets to increase the fineness of the feature processing, a specific computational process can be expressed as:
Wherein, Is the original feature vector,/>Is a cracking matrix used to map the original feature vector to a higher dimensional feature space. Also, in this embodiment, the cracking matrix/>Is based on the statistical properties of the features, in particular, the principal component analysis extracts the principal direction of variation of the features.
Step S204, capturing the dynamic correlation corresponding to the sub-feature set by adopting a transducer structure to obtain the integrated feature.
The method adopts a transducer structure to process the cracked features, and effectively captures the dynamic correlation among the features through a self-attention mechanism, so that the remote feature information can be effectively integrated. Specifically, the coding function of the transducer is set asThe characteristics after the transformation process are expressed as:
wherein the transducer coding function The self-attention mechanism and the feedforward neural network are included, and the weight in the self-attention mechanism is calculated as follows:
Wherein, ,/>,/>Respectively, query, key, value matrix,/>Is the dimension of the key vector.
Further, queryBond/>Sum/>The calculation mode of (2) is as follows:
Wherein, ,/>,/>Is a learnable weight matrix for characterizing/>Mapped to queries, keys and value spaces. /(I)Is the dimension of the key vector, is used for scaling the dot product and prevents the gradient vanishing problem caused by the overlarge dot product.
Step S205, based on distortion and information loss of the integrated feature in a high-dimensional space, optimizing the integrated feature by using a Chebyshev polynomial to obtain an optimized feature.
In particular, using chebyshev polynomialsTo optimize the characteristics of the transducer output, which can effectively reduce the distortion and information loss of the characteristics in a high-dimensional space, and is set/>Is defined as:
further, will Application to/>And/>To achieve feature optimization, the optimized features are expressed as:
And S206, calculating importance scores corresponding to the optimized features and the multi-feature information respectively, and fusing the optimized features and the multi-feature information based on the importance scores to obtain fused features.
Specifically, let us assume the characteristics、/>And/>Importance scores of/>, respectively、/>AndWith dynamic weights of structural features/>For example, the calculation method may be expressed as:
feature optimization 、/>Integrated features after stitching/>Fusion is carried out to obtain fusion characteristics, and fusion function/>Can be expressed as:
Wherein, Is a weighted function, and the weighted weights are dynamic weights, i.e. the duty ratio of the features in the fusion features can be adaptively adjusted according to the importance of the features through dynamic weight distribution.
The final fusion features were:
Wherein the importance score And/>The importance of the feature in the feature-based principal component analysis is obtained.、/>、/>Features/>, respectively、/>、/>Is used to determine the dynamic weight of the model.
And S207, performing feature extraction and data dimension reduction on the fusion features to obtain key information in the fusion features.
The fusion characteristic is determined based on dynamic correlation fusion among the characteristics with finer granularity of the multi-characteristic information, the long-distance characteristic information can be effectively integrated, and the long-distance dependence problem in the sequence data can be effectively processed. By determining key information in the fusion features, accuracy of entity identification can be improved. Specifically, in the embodiment of the invention, the fusion features are subjected to feature extraction through a pre-constructed feature extraction model, and then the extracted features are subjected to dimension reduction through a feature dimension reduction model so as to determine key information.
A-constructing a feature extraction model by the following steps:
In specific implementation, training of a feature extraction model can be performed through fusion features corresponding to training samples, and the neural network algorithm based on the multi-granularity whale optimization algorithm is provided, so that richer and meaningful features are extracted from data in a bidirectional enhancement mode, and entity identification can be performed more accurately. The invention improves the whale optimization algorithm, so that the whale optimization algorithm can search the feature space on a plurality of granularities, thereby capturing data features of different levels, enabling the algorithm to more comprehensively understand the data structure and extracting features which are more beneficial to entity identification. In addition, on the basis of multi-granularity feature extraction, the invention adopts a bidirectional enhancement mechanism to adjust the feature extraction process from two directions: firstly, the semantic understanding of the features is deepened through a deep learning model, and secondly, the searching strategy of the whale optimizing algorithm is guided through a back propagation optimizing algorithm, so that the accuracy of feature extraction is improved, and the self-adaptive capacity of the algorithm is enhanced through the two-way effect.
In specific implementation, the method comprises the following steps a1-a4:
a1 Using the improved whale optimization algorithm to perform preliminary feature extraction on a preset training sample set on a plurality of granularities, and searching for an optimal feature subset in a feature space by simulating social behaviors and predation strategies of whales.
Specifically, the preliminary feature extraction steps are as follows:
(1) Initializing: generating initial whale population positions Wherein/>,/>Is the size of whale population. Each whale position/>Representing a solution in the feature space, i.e. the selection of a set of features.
(2) Calculating the fitness: for each whale positionCalculate its fitness/>. Fitness evaluation is based on feature-selected classifier performance, and in one embodiment, the fitness value is an F1 score.
(3) Updating the position: according to the current optimal solutionUpdating the position of whales, wherein a position updating formula comprises simulating whale predation behaviors and random searching, and specifically comprises the following steps:
Wherein, And/>Is a coefficient calculated iteratively according to an algorithm,/>Is the distance between the current whale position and the target whale (optimal solution). Further, coefficient/>And/>The dynamic calculation is performed according to the iteration of the algorithm, and the calculation mode can be expressed as follows:
Wherein, Is a coefficient linearly decreasing from 2 to 0 with iteration number,/>Is a random number in the range of 0, 1.
A2 Inputting the optimal feature subset into a deep learning model for semantic deepening to obtain enhanced features.
A3 Calculating the loss corresponding to the enhancement features, and adjusting the searching strategy of the whale optimization algorithm according to the loss.
Specifically, the features extracted preliminarily are input into a deep learning model for semantic deepening, and the searching strategy of a whale optimization algorithm is adjusted by using feedback of the model. The deep learning model and the whale optimization algorithm interact, learn each other and jointly optimize the feature set.
Specifically, the steps for optimizing the feature set are as follows:
(1) Using a predetermined deep neural network Semantic enhancement is performed on the extracted features, namely:
Wherein, Is a characteristic representation enhanced by a neural network.
(2) And adjusting a searching strategy of the whale optimization algorithm according to feedback of the deep learning model.
Specifically, the loss function is utilizedThe gradient information of (a) directs the update of the whale's position, namely:
Wherein, Is a dynamic tag corrected tag,/>Is learning rate,/>Is a loss function/>With respect toIs a gradient of (a). In one embodiment, the loss function/>Is a cross entropy loss.
Dynamic tag corrected tagThe dynamic label correction module is obtained by adopting a self-adaptive adjustment mechanism, the weight of the label is dynamically adjusted according to the performance feedback of the model in the training process, the uncertainty of the original label is considered, and the label distribution is optimized through the real-time feedback of the model, so that the negative influence of the error label on the training is reduced, and the robustness of the model is enhanced.
Specifically, it is provided withFor the original training label set, the model prediction output is/>Corrected tagThe calculation of (2) can be expressed as:
Wherein, The self-adaptive adjustment coefficient can be expressed as:
Wherein, Is a preset scaling factor for controlling the intensity of the label correction. In one embodiment, the scaling factor/>Set to 0.01. /(I)Is the accuracy of the model on the verification set of the current iteration for evaluating the current performance of the model.
Further, gradient informationThe calculation of (2) can be expressed as:
Wherein, Is based on the enhancement features/>And real tag/>The calculated loss function, in one embodiment, is a cross entropy loss.
Further, in order to ensure that the feature set finally used for entity recognition contains rich semantic information and has higher degree of distinction, the embodiment of the invention also fuses and selects the features subjected to bidirectional enhancement optimization so as to determine feedback of the deep learning model based on the selected features.
Specifically, features to be enhanced in both directionsAnd original features/>(I.e. the optimal feature subset obtained in step a 1) to obtain the final feature representation/>The method comprises the following steps:
Wherein, Is a fusion weight used to balance the contribution of the enhanced features and the original features, preset by human. In one embodiment,/>Set to 0.3.
Further, based onAnd (3) carrying out feature selection, and selecting a certain proportion of features to reduce feature dimensions and improve model performance. The importance score can be calculated by a preset random forest algorithm in a cross-validation mode. In one embodiment, the first 50% of features are selected according to the importance score.
A4 Until the loss reaches a preset threshold, constructing a feature extraction model based on a whale optimization algorithm and a deep learning model.
In summary, a feature extraction model is constructed to perform feature extraction on the fused features through the model.
B-constructing a characteristic dimension reduction model by the following steps:
The invention provides a self-coding neural network algorithm based on potential interactive compression, which allows a self-coder to learn and retain key interaction information among different features while compressing the features by adopting a potential interactive compression mechanism, and increases the capture capability of a model to nonlinear relations among the features, thereby providing richer information for a classifier.
Specifically, the method comprises the following steps b1-b4:
b1 An encoder and a decoder of the custom encoder.
Defining a custom encoder structure including an encoder portionAnd decoder part/>Wherein/>For inputting feature vectors,/>For potential representation vectors,/>And/>Parameters of the encoder and decoder, respectively. The encoder is used to map the high-dimensional input features to a low-dimensional potential space, and the decoder restores the low-dimensional representations to the original space.
Further, potential interaction learning loss functions are set to encourage models to learn interaction information between features in potential space. The potential interactive learning loss function comprises a reconstruction error and a regularization term, wherein the reconstruction error is obtained through weighted calculation based on the importance of the features. Wherein the optimized loss function in the training process comprises a reconstruction error and a regularization term, which can be expressed as:
Wherein, Is input/>And reconstructing the output/>The reconstruction error between the two is calculated,/(I)Is the interaction weight/>L2 regularization term of/>Is a regularization coefficient.
In one embodiment, the error is reconstructedThe calculation of (2) adopts a weighted calculation mode, and considering that the importance of different characteristics can be different, the calculation mode can be expressed as:
Wherein, Is a feature/>Is adaptively adjusted according to the behavior of the feature in the training data.
And regularizing the termFor preventing model overfitting, ensuring interaction weights/>The value of (2) is not too large, and the calculation mode can be expressed as:
b2 A preset training sample set is obtained, and the encoder and decoder of the self-encoder are trained through the training sample set.
In particular, the output of the feature extraction model may be used to train a self-encoder model in which feature vectors are input during the encoding processEncoded as potential representation/>Can be expressed as: /(I); I.e. encoder/>The high-dimensional input is mapped to the low-dimensional potential space through a series of nonlinear transformation learning.
B3 Optimizing parameters of the self-encoder through a preset potential interaction learning loss function, and introducing feature interaction items into a potential space to reconstruct features based on the feature interaction items.
In specific implementation, parameter optimization is performed based on the potential interaction learning loss function, and the embodiment of the invention introduces interaction items in the potential spaceThe ability to enhance the model to capture complex relationships between features can be expressed as:
Wherein, Representation of features/>And features/>The weight of the interactions between. Further, weight/>To measure potential representation/>Middle characteristics/>And features/>The parameters of the interaction strength are obtained through an adaptive learning mechanism, and the mechanism considers the correlation among the characteristics and the contribution degree to the task.
In one embodiment of the present invention, in one embodiment,Through a preset neural network/>Implementation, the network is in/>And/>Is input and output is/>Can be expressed as:
Wherein, Is a parameterized network,/>Is a network parameter that learns through training how effectively to evaluate the importance of interactions between features.
Further, in the decoding process, the potential representationBy decoder/>Reconstruction as/>While considering interactive items/>Can be expressed as:
b4 Building a feature dimension reduction model based on the trained self-encoder.
In summary, a feature dimension reduction model is constructed to perform dimension reduction processing on input features through a trained encoder part to obtain a compressed feature representationThe feature representation is used as key information for entity identification by a tag prediction model.
Further, the embodiment of the invention also calculates the recalibration weight and the potential dimension task correlation measure of the key information; adjusting the key information according to the recalibration weight and the potential dimension task correlation measurement to obtain recalibration information; and determining the recalibration information as key information of the fusion characteristics.
In particular implementations, a recalibration weight for each feature is calculatedCalculated by preset neurons, can be expressed as:
Wherein, And/>Is the weight and bias parameter of the preset neuron,/>Is a sigmoid activation function used to limit the weight to a range of (0, 1).
Further, use is made ofTo adjust the potential representation/>Obtaining the remarked representation/>Can be expressed as:
Wherein, Hadamard product representing an element,/>Is a diagonal matrix of diagonal elements/>Represents the/>The weights of the individual potential dimensions are dynamically calculated by:
Here relevance (z i) denotes the th Metrics of the relevance of the individual potential dimensions to the final task,/>Is an adjustable scale parameter for controlling the decoupling strength. In one embodiment,/>The functions are implemented through auxiliary network or task performance feedback.
Based on this, the obtained featuresIs the feature after dimension reduction, namely key information.
And step S208, carrying out entity recognition on the key information through a pre-constructed label prediction model, and outputting the key entity information in the target document.
The invention provides a label prediction model based on structure priori knowledge, which combines a deep learning framework of the structure priori knowledge and a conditional random field model to form a structured deep learning model. Compared with the traditional conditional random field, the structured deep learning model can more effectively utilize the dependency relationship among the entities and the context information thereof, and the accuracy and the robustness of entity identification are remarkably improved in a structured mode. Wherein the training process of the model first includes learning deep feature representations of the document using the self-encoder. The structured deep learning model then uses these features and structure prior knowledge to predict tags for entities in the document.
In specific implementation, the embodiment of the invention constructs the label prediction model through the following steps:
1) And acquiring a preset training sample set.
The training sample set comprises a document sample and a sample label corresponding to the document sample, wherein the sample label is used for representing an entity corresponding to the document sample, such as a name, a place, time and the like. Specifically, the tag prediction model may be trained by the above-described feature after dimension reduction.
2) And learning the deep feature representation of the training sample set through a self-encoder, and merging the deep feature representation into a conditional random field model to perform structured entity label prediction.
In specific implementation, the embodiment of the invention designs a specific model structure and a training strategy based on the structure priori knowledge of the archive file so as to explicitly simulate the logic relationship and the dependency among the entities and perform the fusion of the structure priori knowledge.
For the feature after dimension reduction, the feature is fused into a conditional random field model to predict a structured entity label, and the conditional random field layer considers and extracts a feature sequenceAnd outputs the entity tag sequence/>
Specifically, the conditional probability of a tag sequence for a given input sequence is defined as:
Wherein, Is a normalization factor,/>Is a potential function that captures the dependency between neighboring tags and the input feature.
The potential function can be further developed as:
Wherein, And/>Is a slave tag/>Transfer to/>Transfer related weights and biases of/>Representing slave input at location/>The extracted feature vector. /(I)
In one embodiment, the feature functionCombining word embedding vectors and position codes can be expressed as:
Wherein, Representation of position/>Embedded vector of the place word,/>Representation of position/>Encoding vector of/>Representing vector concatenation operations.
Further, for a particular state transition weightAnd bias/>The update strategy is as follows:
Wherein, Is learning rate,/>Is the log-likelihood loss of the conditional random field layer, reflecting the difference between the real tag sequence and the model predicted sequence.
3) In the framework of a structured deep learning model, the parameters of the self-encoder and the conditional random field model are jointly optimized in combination with the reconstruction loss of the self-encoder and the log likelihood of the correct tag sequence in the conditional random field model.
Under the framework of a structured deep learning model, the self-encoder and the conditional random field model parameters are jointly optimized, so that the accuracy of entity label prediction is ensured, and meanwhile, the dependency relationship among entities is reflected. The self-encoder is corresponding to the characteristic dimension reduction model.
Specifically, the training objective combines the reconstruction loss of the deep learning framework and the log-likelihood of the correct tag sequence in the conditional random field model, and can be expressed as:
Wherein, Is the number of training samples,/>Is a regularization parameter balancing two lost parts,/>And/>Respectively is the/>An example real tag sequence and an input sequence.
In one embodiment, the deep learning framework is a self-encoder network used by the data dimension reduction model corresponding to step S214. For encoder weights in a self-encoderAnd decoder weights/>The invention adopts an optimization method based on gradient descent and combines a back propagation algorithm to update parameters.
Specifically, the update to the encoder weights can be expressed as:
Wherein, Is learning rate,/>Is a loss function, expressed as input/>And reconstructing the output/>Mean square error between; decoder weights/>The updating of (c) follows a similar procedure.
4) And performing performance evaluation on the self-encoder and the conditional random field model, and constructing a label prediction model based on the self-encoder and the conditional random field model meeting the performance evaluation requirements.
The performance of the model is evaluated through cross-validation and independent test sets, so that the model is ensured to have good generalization capability on unseen archives.
In summary, according to the archive knowledge extraction method based on the structure priori knowledge provided by the embodiment of the invention, through combining the extraction of the structural features and the font features of the Chinese characters, the deep learning model and the VGG16 network are adopted to accurately extract the radicals, the structural features and the high-dimensional features of the font of the Chinese characters, so that the semantic information of the Chinese characters can be more comprehensively captured, and more rich and accurate feature representation is provided for entity identification. By combining the detailed extraction of the structural features and the font features of the Chinese characters, the invention can more accurately capture the semantic information of the Chinese characters and provide rich and accurate feature representation for entity identification. Particularly, when a Chinese document with a complex structure is processed, the accuracy of entity identification can be remarkably improved.
In addition, through innovative application of the feature fusion model, the invention adopts a transform algorithm based on cracking features and utilizes Chebyshev polynomials to improve and optimize the feature fusion process, thereby not only improving the capability of the algorithm for processing complex data structures, but also effectively solving the problem of long-distance dependence in sequence data, and enhancing the capturing capability of the model for interaction and complementation between features, thereby improving the generalization capability of the model.
The invention also improves the whale optimization algorithm, so that the whale optimization algorithm can search the feature space on a plurality of granularities and capture the data features of different layers. By combining a bidirectional enhancement mechanism, the semantic understanding of the features is deepened, the searching and extracting strategies of the features are optimized, so that the algorithm can more comprehensively understand the data structure, the accuracy of feature extraction and the self-adaptive capacity of a model are improved, and the features which are more beneficial to entity identification are extracted from the data.
Furthermore, the embodiment of the invention allows the self-encoder to learn and retain key interaction information among different features while compressing the features by adopting a potential interaction compression mechanism, thereby effectively reducing information loss in a high-dimensional feature space and improving processing efficiency. Moreover, the capture capability of the model on nonlinear relations among the features is increased, and richer information is provided for the classifier.
Furthermore, the embodiment of the invention also combines the deep learning framework of the structure priori knowledge and the conditional random field model to form a structured deep learning model, which can effectively utilize the dependency relationship among the entities and the context information thereof, and particularly remarkably improve the accuracy and the robustness of entity identification in a structuring manner when processing complex entity relationship and rich context information.
Further, on the basis of the above method embodiment, the embodiment of the present invention further provides a structure priori knowledge-based archival knowledge extraction device, and fig. 4 shows a schematic structural diagram of the structure priori knowledge-based archival knowledge extraction device provided by the embodiment of the present invention, as shown in fig. 4, the structure priori knowledge-based archival knowledge extraction device includes: a data acquisition module 100 for acquiring a target document; the feature extraction module 200 is configured to perform multi-feature extraction on the target document to obtain multi-feature information in the target document; the multi-feature information includes structural features and glyph features; the feature fusion module 300 is configured to input the multi-feature information into a pre-constructed feature fusion model, perform feature fusion on the multi-feature information based on feature correlation of the multi-feature extraction features, and generate fusion features; the preprocessing module 400 is used for extracting features of the fusion features and reducing the dimension of the data to obtain key information in the fusion features; the output module 500 is configured to perform entity identification on the key information through a pre-constructed label prediction model, and output key entity information in the target document; the tag prediction model is constructed based on a priori knowledge of the structure incorporating the deep feature representation.
The archive knowledge extraction device based on the structure priori knowledge provided by the embodiment of the invention has the same technical characteristics as the archive knowledge extraction method based on the structure priori knowledge provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.
Further, on the basis of the above embodiment, the embodiment of the present invention further provides another archive knowledge extraction device based on structure priori knowledge, where the feature extraction module 200 is further configured to perform multi-feature extraction on a target document to obtain multi-feature information in the target document, and the step includes: inputting the target document into a pre-constructed multi-feature extraction model, and performing multi-feature extraction on the target document through the multi-feature extraction model to obtain multi-feature information in the target document; the multi-feature extraction model comprises a structural feature extraction model and a font feature extraction model; the structural feature extraction model is used for extracting structural features of the characters in the target document, and the font feature extraction model is used for extracting font features of the characters in the target document.
The feature extraction module 200 is further configured to obtain a pre-collected chinese file, and collect chinese character data in the chinese file; labeling the Chinese character data according to the Chinese character characteristics of the Chinese character data to obtain a data label; the Chinese character features include radicals and fonts of Chinese character data; splitting and extracting structural features of the Chinese character data according to the radical structure by utilizing a first convolutional neural network, and determining structural feature representation; capturing the character form detail characteristics of the Chinese character data through a VGG16 network; splicing the structural feature representation and the font detail feature into a comprehensive feature vector, training the second convolutional neural network model through the comprehensive feature vector and the data tag, and optimizing model parameters of the second convolutional neural network model by adopting an Adam optimizer; and constructing a multi-feature extraction model based on the trained second convolutional neural network model.
The above feature extraction module 200 is further configured to determine a context feature corresponding to the chinese character data according to the structural feature representation and the font detail feature; by passing throughThe function evaluates the relevance of the structural feature representation and the font detail feature with the context feature respectively; calculating feature weights respectively corresponding to the structural feature representation and the font detail features based on the correlation; the structural feature representation and the font detail feature are stitched into a composite feature vector based on feature weights.
The feature fusion module 300 is further configured to perform feature cracking processing on the multi-feature information to obtain a sub-feature set with fine granularity; capturing dynamic correlation corresponding to the sub-feature set by adopting a transducer structure to obtain an integrated feature; based on distortion and information loss of the integrated features in a high-dimensional space, optimizing the integrated features by using a chebyshev polynomial to obtain optimized features; and calculating importance scores corresponding to the optimized features and the multi-feature information respectively, and fusing the optimized features and the multi-feature information based on the importance scores to obtain fusion features.
The preprocessing module 400 is further configured to calculate a recalibration weight of the key information and a potential dimension task relevance metric; adjusting the key information according to the recalibration weight and the potential dimension task correlation measurement to obtain recalibration information; and determining the recalibration information as key information of the fusion characteristics.
Further, in the embodiment of the present invention, feature extraction is performed on the fused features by using a pre-constructed feature extraction model, and the preprocessing module 400 is further configured to: preliminary feature extraction is carried out on a preset training sample set on a plurality of granularities by using an improved whale optimization algorithm, and an optimal feature subset is searched in a feature space by simulating social behaviors and predation strategies of whales; inputting the optimal feature subset into a deep learning model for semantic deepening to obtain enhanced features; calculating the loss corresponding to the enhancement features, and adjusting a searching strategy of a whale optimization algorithm according to the loss; and constructing a feature extraction model based on the whale optimization algorithm and the deep learning model until the loss reaches a preset threshold.
Further, in the embodiment of the present invention, the data dimension reduction is performed on the fusion feature by using a pre-constructed feature dimension reduction model, and the preprocessing module 400 is further configured to: an encoder and a decoder of the custom encoder; acquiring a preset training sample set, and training an encoder and a decoder of a self-encoder through the training sample set; optimizing parameters of the self-encoder through a preset potential interaction learning loss function, and introducing feature interaction items into a potential space to reconstruct features based on the feature interaction items; the potential interactive learning loss function comprises a reconstruction error and a regularization term, wherein the reconstruction error is obtained by weighting calculation based on the importance of the features; and constructing a feature dimension reduction model based on the trained self-encoder.
The output module 500 is further configured to obtain a preset training sample set; the training sample set comprises a document sample and a sample label corresponding to the document sample, wherein the sample label is used for representing an entity corresponding to the document sample; learning deep feature representation of a training sample set through a self-encoder, and merging the deep feature representation into a conditional random field model to predict a structured entity tag; under the framework of a structured deep learning model, combining the reconstruction loss of the self-encoder and the log likelihood of a correct tag sequence in a conditional random field model to jointly optimize the parameters of the self-encoder and the conditional random field model; performing performance evaluation on the self-encoder and the conditional random field model; a label prediction model is constructed based on the self-encoder and conditional random field model that meet performance evaluation requirements.
The embodiment of the invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the method shown in the figures 1 to 3.
The embodiments of the present invention also provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor performs the steps of the method shown in fig. 1 to 3 described above.
The embodiment of the present invention further provides a schematic structural diagram of an electronic device, as shown in fig. 5, where the electronic device includes a processor 51 and a memory 50, where the memory 50 stores computer executable instructions that can be executed by the processor 51, and the processor 51 executes the computer executable instructions to implement the methods shown in fig. 1 to 3.
In the embodiment shown in fig. 5, the electronic device further comprises a bus 52 and a communication interface 53, wherein the processor 51, the communication interface 53 and the memory 50 are connected by the bus 52.
The memory 50 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The communication connection between the system network element and at least one other network element is achieved via at least one communication interface 53 (which may be wired or wireless), and the internet, wide area network, local network, metropolitan area network, etc. may be used. Bus 52 may be an ISA (Industry Standard Architecture ) Bus, PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) Bus, EISA (Extended Industry Standard Architecture ) Bus, etc., or AMBA (Advanced Microcontroller Bus Architecture, standard for on-chip buses) Bus, where AMBA defines three types of buses, including an APB (ADVANCED PERIPHERAL Bus) Bus, an AHB (ADVANCED HIGH-performance Bus) Bus, and a AXI (Advanced eXtensible Interface) Bus. The bus 52 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one bi-directional arrow is shown in FIG. 5, but not only one bus or type of bus.
The processor 51 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 51 or by instructions in the form of software. The processor 51 may be a general-purpose processor, including a central processing unit (Central Processing Unit, abbreviated as CPU), a network processor (Network Processor, abbreviated as NP), and the like; but may also be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application Specific Integrated Circuit (ASIC), field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory and the processor 51 reads the information in the memory and in combination with its hardware performs the method shown in any of the foregoing figures 1 to 3.
The embodiment of the invention provides a method and a device for extracting archival knowledge based on structure priori knowledge, which comprises a computer readable storage medium storing program codes, wherein the program codes comprise instructions for executing the method described in the previous method embodiment, and specific implementation can be seen in the method embodiment and will not be repeated here.
It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding process in the foregoing method embodiment for the specific working process of the above-described system, which is not described herein again.
Finally, it should be noted that: the above examples are only specific embodiments of the present invention for illustrating the technical solution of the present invention, but not for limiting the scope of the present invention, and although the present invention has been described in detail with reference to the foregoing examples, it will be understood by those skilled in the art that the present invention is not limited thereto: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (10)

1. The archive knowledge extraction method based on the structure priori knowledge is characterized by comprising the following steps of:
Acquiring a target document;
Extracting multiple features of the target document to obtain multiple feature information in the target document; the multi-feature information comprises structural features and font features;
Inputting the multi-feature information into a pre-constructed feature fusion model, and carrying out feature fusion on the multi-feature information based on the feature correlation of the multi-feature information to generate fusion features;
Performing feature extraction and data dimension reduction on the fusion features to obtain key information in the fusion features;
Performing entity identification on the key information through a pre-constructed label prediction model, and outputting key entity information in the target document; the tag prediction model is constructed based on a priori knowledge of the structure incorporating the representation of the deep features.
2. The method of claim 1, wherein the step of performing multi-feature extraction on the target document to obtain multi-feature information in the target document comprises:
Inputting the target document into a pre-constructed multi-feature extraction model, and performing multi-feature extraction on the target document through the multi-feature extraction model to obtain multi-feature information in the target document;
The multi-feature extraction model comprises a structural feature extraction model and a font feature extraction model; the structural feature extraction model is used for extracting structural features of characters in the target document, and the font feature extraction model is used for extracting font features of the characters in the target document.
3. The method according to claim 2, wherein the method for constructing the multi-feature extraction model comprises:
Acquiring a Chinese file collected in advance, and collecting Chinese character data in the Chinese file;
labeling the Chinese character data according to the Chinese character characteristics of the Chinese character data to obtain a data label; the Chinese character features comprise radicals and fonts of the Chinese character data;
Splitting the Chinese character data according to a radical structure and extracting structural features by using a first convolutional neural network to determine structural feature representation; capturing the character form detail characteristics of the Chinese character data through a VGG16 network;
Splicing the structural feature representation and the font detail feature into a comprehensive feature vector;
Training a second convolutional neural network model through the comprehensive feature vector and the data tag, and optimizing model parameters of the second convolutional neural network model by adopting an Adam optimizer;
And constructing a multi-feature extraction model based on the trained second convolutional neural network model.
4. A method according to claim 3, characterized in that the step of stitching the structural feature representation and the font detail feature into a comprehensive feature vector comprises:
Determining the context characteristics corresponding to the Chinese character data according to the structural characteristic representation and the font detail characteristics;
By passing through Functionally evaluating the relevance of the structural feature representation and the font detail feature, respectively, to the contextual feature;
Calculating feature weights respectively corresponding to the structural feature representation and the font detail features based on the correlation;
And splicing the structural feature representation and the font detail feature into a comprehensive feature vector based on the feature weight.
5. The method according to claim 1, wherein the step of inputting the multi-feature information into a pre-built feature fusion model, feature-fusing the multi-feature extracted features based on feature correlation of the multi-feature information, and generating fused features, comprises:
Performing characteristic cracking treatment on the multi-characteristic information to obtain a sub-characteristic set with fine granularity;
Capturing the dynamic correlation corresponding to the sub-feature set by adopting a transducer structure to obtain an integrated feature;
optimizing the integrated feature by using a chebyshev polynomial based on the distortion and information loss of the integrated feature in a high-dimensional space to obtain an optimized feature;
And calculating importance scores corresponding to the optimized features and the multi-feature information respectively, and fusing the optimized features and the multi-feature information based on the importance scores to obtain fusion features.
6. The method according to claim 1, wherein the method further comprises:
Calculating recalibration weight and potential dimension task correlation measurement of the key information;
Adjusting the key information according to the recalibration weight and the potential dimension task correlation measurement to obtain recalibration information;
and determining the recalibration information as key information of the fusion characteristic.
7. The method according to claim 1, wherein the feature extraction is performed on the fused features by a pre-built feature extraction model, and the method for constructing the feature extraction model comprises:
preliminary feature extraction is carried out on a preset training sample set on a plurality of granularities by using an improved whale optimization algorithm, and an optimal feature subset is searched in a feature space by simulating social behaviors and predation strategies of whales;
inputting the optimal feature subset into a deep learning model for semantic deepening to obtain enhanced features;
Calculating the loss corresponding to the enhancement features, and adjusting the searching strategy of the whale optimization algorithm according to the loss;
And constructing a feature extraction model based on the whale optimization algorithm and the deep learning model until the loss reaches a preset threshold.
8. The method according to claim 1, wherein the data dimension reduction is performed on the fused features by a pre-constructed feature dimension reduction model, and the method for constructing the feature dimension reduction model comprises the following steps:
An encoder and a decoder of the custom encoder;
Acquiring a preset training sample set, and training an encoder and the decoder of the self-encoder through the training sample set;
Optimizing parameters of the self-encoder through a preset potential interaction learning loss function, and introducing a feature interaction item into a potential space so as to reconstruct features based on the feature interaction item; the potential interactive learning loss function comprises a reconstruction error and a regularization term, wherein the reconstruction error is obtained by weighting calculation based on the importance of the features;
and constructing a feature dimension reduction model based on the trained self-encoder.
9. The method of claim 1, wherein the method of constructing the tag prediction model comprises:
Acquiring a preset training sample set; the training sample set comprises a document sample and a sample label corresponding to the document sample, wherein the sample label is used for representing an entity corresponding to the document sample;
Learning a deep feature representation of the training sample set through a self-encoder, and merging the deep feature representation into a conditional random field model to perform structured entity tag prediction;
Under the framework of a structured deep learning model, combining the reconstruction loss of the self-encoder and the log likelihood of a correct tag sequence in a conditional random field model to jointly optimize parameters of the self-encoder and the conditional random field model;
Performing a performance evaluation on the self-encoder and the conditional random field model;
a label prediction model is constructed based on the self-encoder and conditional random field model that meet performance evaluation requirements.
10. An archival knowledge extraction device based on structure priori knowledge, the device comprising:
The data acquisition module is used for acquiring a target document;
the feature extraction module is used for extracting multiple features of the target document to obtain multiple feature information in the target document; the multi-feature information comprises structural features and font features;
The feature fusion module is used for inputting the multi-feature information into a pre-constructed feature fusion model, and carrying out feature fusion on the multi-feature information based on the feature correlation of the multi-feature extraction features to generate fusion features;
The preprocessing module is used for carrying out feature extraction and data dimension reduction on the fusion features to obtain key information in the fusion features;
the output module is used for carrying out entity identification on the key information through a pre-constructed label prediction model and outputting the key entity information in the target document; the tag prediction model is constructed based on a priori knowledge of the structure incorporating the representation of the deep features.
CN202410592269.7A 2024-05-14 2024-05-14 File knowledge extraction method and device based on structure priori knowledge Active CN118170836B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410592269.7A CN118170836B (en) 2024-05-14 2024-05-14 File knowledge extraction method and device based on structure priori knowledge

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410592269.7A CN118170836B (en) 2024-05-14 2024-05-14 File knowledge extraction method and device based on structure priori knowledge

Publications (2)

Publication Number Publication Date
CN118170836A true CN118170836A (en) 2024-06-11
CN118170836B CN118170836B (en) 2024-09-13

Family

ID=91360808

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410592269.7A Active CN118170836B (en) 2024-05-14 2024-05-14 File knowledge extraction method and device based on structure priori knowledge

Country Status (1)

Country Link
CN (1) CN118170836B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118568263A (en) * 2024-07-31 2024-08-30 山东能源数智云科技有限公司 Electronic archive intelligent classification method and device based on deep learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021000362A1 (en) * 2019-07-04 2021-01-07 浙江大学 Deep neural network model-based address information feature extraction method
CN114528411A (en) * 2022-01-11 2022-05-24 华南理工大学 Automatic construction method, device and medium for Chinese medicine knowledge graph
WO2022134164A1 (en) * 2020-12-24 2022-06-30 科大讯飞股份有限公司 Translation method, apparatus and device, and storage medium
CN115098637A (en) * 2022-06-29 2022-09-23 中译语通科技股份有限公司 Text semantic matching method and system based on Chinese character shape-pronunciation-meaning multi-element knowledge
CN115687634A (en) * 2022-09-06 2023-02-03 华中科技大学 Financial entity relationship extraction system and method combining priori knowledge
CN115858825A (en) * 2023-03-02 2023-03-28 山东能源数智云科技有限公司 Equipment fault diagnosis knowledge graph construction method and device based on machine learning
CN117290489A (en) * 2023-11-24 2023-12-26 烟台云朵软件有限公司 Method and system for quickly constructing industry question-answer knowledge base

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021000362A1 (en) * 2019-07-04 2021-01-07 浙江大学 Deep neural network model-based address information feature extraction method
WO2022134164A1 (en) * 2020-12-24 2022-06-30 科大讯飞股份有限公司 Translation method, apparatus and device, and storage medium
CN114528411A (en) * 2022-01-11 2022-05-24 华南理工大学 Automatic construction method, device and medium for Chinese medicine knowledge graph
CN115098637A (en) * 2022-06-29 2022-09-23 中译语通科技股份有限公司 Text semantic matching method and system based on Chinese character shape-pronunciation-meaning multi-element knowledge
CN115687634A (en) * 2022-09-06 2023-02-03 华中科技大学 Financial entity relationship extraction system and method combining priori knowledge
CN115858825A (en) * 2023-03-02 2023-03-28 山东能源数智云科技有限公司 Equipment fault diagnosis knowledge graph construction method and device based on machine learning
CN117290489A (en) * 2023-11-24 2023-12-26 烟台云朵软件有限公司 Method and system for quickly constructing industry question-answer knowledge base

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Span-Based Fine-Grained Entity-Relation Extraction via Sub-Prompts Combination", APPLIED SCIENCES-BASEL, vol. 13, no. 2, 10 February 2023 (2023-02-10) *
张心怡;冯仕民;丁恩杰;: "面向煤矿的实体识别与关系抽取模型", 计算机应用, no. 08 *
张心怡;冯仕民;丁恩杰;: "面向煤矿的实体识别与关系抽取模型", 计算机应用, no. 08, 15 July 2020 (2020-07-15) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118568263A (en) * 2024-07-31 2024-08-30 山东能源数智云科技有限公司 Electronic archive intelligent classification method and device based on deep learning

Also Published As

Publication number Publication date
CN118170836B (en) 2024-09-13

Similar Documents

Publication Publication Date Title
CN107516110B (en) Medical question-answer semantic clustering method based on integrated convolutional coding
CN110413785B (en) Text automatic classification method based on BERT and feature fusion
CN111191526B (en) Pedestrian attribute recognition network training method, system, medium and terminal
CN108959482B (en) Single-round dialogue data classification method and device based on deep learning and electronic equipment
CN118170836B (en) File knowledge extraction method and device based on structure priori knowledge
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN107220506A (en) Breast cancer risk assessment analysis system based on deep convolutional neural network
CN111127146B (en) Information recommendation method and system based on convolutional neural network and noise reduction self-encoder
CN113220886A (en) Text classification method, text classification model training method and related equipment
CN109389151A (en) A kind of knowledge mapping treating method and apparatus indicating model based on semi-supervised insertion
Wu et al. Optimized deep learning framework for water distribution data-driven modeling
Veness et al. Online learning with gated linear networks
CN112699243B (en) Method for rolling network text based on French chart method and medium for classifying cases and documents
CN111241271B (en) Text emotion classification method and device and electronic equipment
CN117851921B (en) Equipment life prediction method and device based on transfer learning
CN112347245A (en) Viewpoint mining method and device for investment and financing field mechanism and electronic equipment
CN117351550A (en) Grid self-attention facial expression recognition method based on supervised contrast learning
CN109508640A (en) Crowd emotion analysis method and device and storage medium
CN112766339A (en) Trajectory recognition model training method and trajectory recognition method
Cao et al. A dual attention model based on probabilistically mask for 3D human motion prediction
CN116956228A (en) Text mining method for technical transaction platform
CN113408721A (en) Neural network structure searching method, apparatus, computer device and storage medium
CN117115600A (en) No-reference image quality evaluation method and device and electronic equipment
CN111325027B (en) Sparse data-oriented personalized emotion analysis method and device
Liu et al. STDNet: Rethinking Disentanglement Learning With Information Theory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant