CN114462384A

CN114462384A - Metadata automatic generation device for digital object modeling

Info

Publication number: CN114462384A
Application number: CN202210380242.2A
Authority: CN
Inventors: 黄罡; 杨婧如; 姜海鸥; 景翔; 柳熠; 蔡华谦; 郭京申; 刁兴春
Original assignee: Beijing Big Data Advanced Technology Research Institute; Peking University
Current assignee: Beijing Big Data Advanced Technology Research Institute; Peking University
Priority date: 2022-04-12
Filing date: 2022-04-12
Publication date: 2022-05-10
Anticipated expiration: 2042-04-12
Also published as: CN114462384B

Abstract

The invention discloses a metadata automatic generation device facing digital object modeling, which relates to the technical field of digital objects, and supports the automatic extraction of metadata meeting relevant standards (such as Dublin core standard) from data resource relevant description documents or data resource original files used for digital object modeling to form a metadata part in a digital object, thereby being used for the automatic modeling of the digital object. The device is provided with an automatic classification module, a keyword extraction module, an abstract extraction module, a data attribute extraction module, a time information extraction module and a region information extraction module which are used for automatically generating the public attribute metadata of the digital object, and is also provided with other metadata extraction modules which are used for automatically generating other extended attribute metadata of the digital object.

Description

Metadata automatic generation device for digital object modeling

Technical Field

The invention relates to the technical field of digital objects, in particular to a metadata automatic generation device for modeling of a digital object.

Background

The Digital Object system standardizes data resources of the internet through Digital objects in a unified mode, adopts two Protocol standard data interaction behaviors of a Digital Object Interface Protocol (DOIP) and a DO-IRP identification and analysis Protocol (IRP), and realizes interconnection and intercommunication of heterogeneous, allopatric and abnormal main data based on an open software architecture formed by three core systems. The basic model of the digital object hierarchy is a digital object model, i.e. each digital object consists of three parts: identification, metadata, data ontology. Where metadata is a structured description of the content and attribute characteristics of the data ontology for discovery, evaluation, and management of digital objects. Generating metadata is a critical task for modeling digital objects. The three core systems include a digital object repository system, a digital object registry system, and a digital object identification system. The digital object registry system is mainly responsible for managing metadata of access resources and providing functions of searching, classifying, cataloging and the like of the metadata. After the resource is accessed, metadata needs to be filled in/generated for the resource to complete the modeling of the digital object.

On one hand, however, in order to improve the capability of discovering a digital object, various fields often require to fill in extensive metadata information when modeling the digital object; on the other hand, with the popularization of the application of the digital object system in various business departments and fields, the system data volume increases dramatically, and the requirement for batch modeling of digital objects is continuously raised, so that the requirement for automatic generation of metadata for modeling of digital objects is increasing day by day. Meanwhile, metadata standards adopted by different service scenes and service requirements are different. This puts new demands on the versatility, extensibility and adaptability of the metadata automation generation technology that models digital objects.

Disclosure of Invention

Embodiments of the present invention provide an apparatus for automatically generating metadata for modeling a digital object, so as to implement automatic generation of a metadata portion for modeling a digital object, thereby overcoming one or more of the above-mentioned problems.

In order to solve the above problems, an embodiment of the present invention discloses an automatic metadata generation device for digital object modeling, including:

the automatic classification module is used for extracting metadata with metadata items as types from the original data of the digital object based on a predetermined standard; the original data is a data resource related description document and/or a data resource original file which are received in advance and used for modeling the digital object;

the keyword extraction module is used for extracting metadata taking the metadata item as a subject from the original data;

the abstract extraction module is used for extracting metadata items from the original data as the described metadata;

the data attribute extraction module is used for extracting metadata items from the original data as metadata with formats and dates;

the time information extraction module is used for extracting metadata items from the original data as time metadata of a coverage range;

the region information extraction module is used for extracting the region metadata with the metadata item as the coverage range from the original data;

and the other metadata extraction module is used for extracting metadata of other metadata items from the original data.

Optionally, the metadata of the type of the metadata item includes a category name of the digital object; the automatic classification module comprises:

the text vectorization submodule is used for mapping each pre-obtained user-defined classification option and the digital object description abstract in the original data into a uniform vector space and generating embedding of a plurality of classification options with the same dimension;

and the similarity calculation operator module is used for calculating the cosine similarity between the embedding of each classification option and the embedding of the digital object description abstract, and taking the classification option with the highest cosine similarity as the name of the digital object.

Optionally, the keyword extraction module includes:

a first keyword extraction sub-module, configured to extract, from original data, a plurality of first candidate keywords for describing a subject of the digital object based on a word frequency-inverse document frequency algorithm, and calculate a weight of each of the first candidate keywords;

a second keyword extraction sub-module for extracting a plurality of second candidate keywords for describing the subject of the digital object from the original data based on a text sorting algorithm and calculating a weight of each of the second candidate keywords;

and the keyword calculation sub-module is used for carrying out weighted average on the weights of the first candidate keyword and the weights of the second candidate keyword and the second candidate keyword, and taking the first K keywords with the maximum weights as metadata for describing the theme of the digital object.

Optionally, the region information extracting module includes:

the data set construction submodule is used for acquiring geographic information from a pre-selected geographic information service application interface through a crawler technology and constructing a geographic information data set;

the part-of-speech recognition submodule is used for segmenting the text in the original data and then recognizing target words with the part-of-speech being place names and transliterated place names from the segmented text;

and the semantic matching submodule is used for performing semantic matching on the target words and the plurality of geographic information in the geographic information data set, determining the region information of the digital object from the target words, and taking the region information as the region metadata of the digital object.

Optionally, the other metadata extraction modules include a semantic function extraction submodule and a custom rule extraction submodule, where:

the semantic function extraction submodule is used for extracting first information with similar semantics with other target metadata items from the original data and taking the first information as metadata of the other target metadata items;

and the custom rule extraction submodule is used for extracting second information similar to the semantics or structural characteristics of other target metadata items from the original data based on a rule which is customized by a user in advance, and taking the second information as metadata of other target metadata items.

Optionally, the semantic function extracting sub-module includes:

a key-value format document extraction unit for extracting a key-value format document from the original data;

and the semantic similarity calculation unit is used for calculating the Chinese names, English names and alias names of the target other metadata items aiming at the key-value format document, defining the semantic similarity of each key in the key-value format document, taking the key with the semantic similarity larger than a preset threshold and the largest as an item matched with the target other metadata items, and taking the value corresponding to the key as the metadata of the target other metadata items.

Optionally, the key-value format document extracting unit includes:

a direct extraction subunit, configured to directly extract the key-value format document in the original data;

the table extraction subunit is used for identifying the row names and/or column names of the tables in the data resource related description document, using the identified row names and/or column names as keys, and using the cell contents corresponding to the row names and/or column names as values to obtain a key-value format document;

and the unstructured text extraction subunit is used for segmenting the unstructured text in the data resource related description document, and then matching a plurality of key-value pairs in the segmented unstructured text by utilizing a semantic template to obtain a key-value format document.

Optionally, the custom rule extraction sub-module includes:

the semantic feature extraction unit is used for determining values corresponding to other target metadata items as values corresponding to keys in the original data as target words according to a user-defined semantic feature extraction rule, extracting the values corresponding to the target words from the original data and using the values as metadata of the other target metadata items;

the structural feature extraction unit is used for determining values corresponding to other target metadata items as values corresponding to keys in the original data, which are target characters and have a target font format, according to a user-defined visual feature extraction rule and a character feature extraction rule, extracting the values corresponding to the target characters and the target font format from the original data, and using the extracted values as metadata of the other target metadata items;

and the knowledge extraction unit is used for generating a knowledge extraction rule according to a knowledge base uploaded by a user and extracting the metadata item information appointed in the original data based on the knowledge extraction rule.

Optionally, the apparatus further comprises:

the extended metadata item self-adapting module is used for providing an operable interface of a configuration file of the digital object for a user, acquiring metadata items and related definitions added or extended by the user in the operable interface, and storing the added or extended metadata items and related definitions in the form of the configuration file.

Optionally, the apparatus further comprises:

an extensible metadata storage module, the extensible metadata storage module including a metadata pattern storage table and a metadata storage table, the metadata pattern storage table identifying foreign key associations based on metadata items, the metadata pattern storage table being used to store metadata items of digital objects; in the metadata mode storage table, a metadata item identifier and a metadata item parent identifier adopt Huffman prefix coding; the metadata storage table is used for storing metadata of each metadata item of the digital object;

the extensible metadata storage module is used for writing a target metadata item newly added to the digital object currently into the metadata mode storage table, and automatically filling metadata of the target metadata item into the metadata mode storage table through the metadata item identification of the target metadata item.

The embodiment of the invention has the following advantages:

the invention relates to a metadata automatic generation device facing digital object modeling, which supports automatic extraction of metadata meeting relevant standards (such as Dublin core standard) from a data resource relevant description document or a data resource original file for digital object modeling to form a metadata part in a digital object, thereby being used for automatic modeling of the digital object. Wherein, the device can be used for automatically generating the metadata of the common attribute of the digital object and automatically generating the metadata of other extended attributes of the digital object, common attribute metadata of the digital object can be generated by arranging an automatic classification module, a keyword extraction module, an abstract extraction module, a data attribute extraction module, a time information extraction module and a region information extraction module, other extended attribute metadata for a digital object can be generated by providing other metadata extraction modules such as a semantic-based metadata extraction module (i.e., a semantic function extraction sub-module) and a custom rule-based metadata extraction module (i.e., a custom rule extraction sub-module), therefore, the requirements of universality, expandability and self-adaptive capacity of the metadata automatic generation technology for digital object modeling can be met, and automatic batch modeling of the digital objects can be realized.

Furthermore, the device also comprises an extension metadata item self-adapting module and an extension metadata storage module, so that the device can realize the newly adding and self-adapting storage of the digital object metadata item, the distributed storage division of the digital object metadata and the XML description form conversion of the digital object metadata by a user through the extension metadata item self-adapting module.

Drawings

FIG. 1 is a functional block diagram of an apparatus for automatically generating metadata for modeling digital objects according to an embodiment of the present invention;

FIG. 2 is a diagram of a text vectorization model in an automatic classification module according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating a geographic information matching performed by the geographic information extraction module according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating the semantic function extraction sub-module performing metadata extraction according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

To solve the technical problem of the present invention, the present invention provides an automatic metadata generation apparatus for digital object modeling, and referring to fig. 1, a functional block diagram of the automatic metadata generation apparatus for digital object modeling is shown, and the apparatus may include:

the abstract extraction module is used for extracting metadata items from the original data as descriptive metadata;

and the other metadata extraction module is used for extracting the metadata of other metadata items from the original data.

In the invention, a metadata automatic generation device facing digital object modeling is provided, which supports extracting common attribute metadata of digital objects meeting relevant standards (such as Dublin standard) from data resource relevant description documents and/or data resource original files provided by users for digital object modeling. The device supports analyzing data resource related description documents in json, txt, csv, xml and pdf formats and extracting metadata information from the data resource related description documents. The invention can provide two optional output forms of a relational database table and json and xml documents, wherein the metadata documents in the json and xml formats can be conveniently and automatically read by a computer system.

Since the dublin standard is a recognized metadata standard which is most widely applied and common, taking a predetermined standard as the dublin standard as an example, the following describes an implementation manner of each functional module (i.e., an automatic classification module, a keyword extraction module, a summary extraction module, a data attribute extraction module, a time information extraction module, a region information extraction module, and other metadata extraction modules) of the metadata automatic generation apparatus for digital object modeling according to the present invention. Wherein, with reference to table 1, the dublin standard contains the following metadata items:

TABLE 1

Metadata items	Definition of	Note
			Title (title)	Naming resources	Generally refers to the formally disclosed name of a resource object
Creator (Creator)	Principal responsible person for creating resource content	Identification by creator name
			Subject (Subject)	Subject matter description of resource content	A topic describing a particular resource may be a keyword, category number, or the like
Description (Description)	Description of resource content	The description contents are as follows: abstracts, catalogues, text descriptions of graphics, and the like
			Publisher (publishers)	Making resources available and usable as responsible parties	May include individuals, organizations or services
Other responsibilities (Contributor)	Other entities contributing to the content of the resource	May include individuals, organizations or services
			Date (date)	Time associated with an event in a resource lifecycle	The date should be related to the creation or publication date of the resource
Type (Type)	Characteristics or types of resource content	Including terms describing the general category, function, genus or cluster hierarchy of resource content
			Format (format)	Physical or digital representation of resources	Including media type of the asset or size of the asset, software, hardware, or other equipment used to determine the presentation or operation of the asset
Identifier (Identifier)	Giving resources an explicit identification within a specific range	It is proposed to use character strings and number combinations conforming to a formal identification system
			Source (Source)	Reference to current resource source	The current resource may be partially or wholly derived from the resource identified by the element
Language (Language)	Language for describing resource content	The value of the element is suggested to adopt RFC3066
			Association (relationship)	Referencing related resources	Preferably, the resource to be referred to is identified using a character string or number conforming to a specification identification system
Coverage (Coverage)	Extension and coverage involved with resource content	Spatial location, time interval or range of administrative district
			Rights (rights)	Information about rights owned or granted by the resource itself	Including Intellectual Property Rights (IPR), copyright, or other property rights

The corresponding relation between each functional module and the Dublin standard is as follows: 1) the automatic classification module is used for extracting metadata information of which the metadata item is of type; 2) the keyword extraction module is used for extracting metadata information of which the metadata item is a theme; 3) the abstract extraction module is used for extracting metadata information of which the metadata item is 'description'; 4) the data attribute extraction module is used for extracting metadata information of metadata items of 'format' and 'date'; 5) the time information extraction module is used for extracting time type metadata information of which the metadata item is 'coverage range'; 6) the region information extraction module is used for extracting region metadata information with a metadata item of coverage; 7) the other metadata extraction module can be used for extracting other metadata information in the dublin standard, namely other metadata items can be subject names, creators, other responsible persons, identifiers, sources, languages, associations and authorities; of course, other metadata extraction modules may be used to extract metadata information outside the dublin standard, such as extracting metadata information for a newly added metadata item, such as "privacy level".

In the apparatus of the present invention, each functional module has no sequence dependency relationship, and corresponds to the dublin standard only.

In one embodiment of the present invention, the metadata of which the metadata item is a type includes a category name of the digital object; the automatic classification module may include:

As shown in table 1, in the dublin standard, the metadata of which metadata item is "type" is the feature or type of the content of the digital object, which can be understood as the category of the digital object, for example: under the industry classification system, the categories into which a digital object can be classified are: energy sources ', raw materials', 'industry', 'alternative consumption', 'major consumption', 'medical health', 'financial property', 'information technology', 'telecommunication services', 'utilities', etc.

The automatic classification module of this embodiment adopts an unsupervised multi-classification method, and a method frame is shown in fig. 2, that is, first, each classification option defined by a user and a digital object description abstract in original data are input into a text vectorization model obtained by pre-training, in the text vectorization model, first, each classification option and the digital object description abstract are mapped into a uniform vector space, embedding with the same dimension is generated, then, a cos similarity value between each classification option embedding and the digital object description abstract is calculated, and an option with a relatively high similarity value is taken as a "category name" of the digital object.

It should be noted that, the conventional BERT (pre-training language Representation model, which is called as Bidirectional Encoder replication from transforms) model determines whether two sentences have similar semantics, and two sentences need to be pieced together and transmitted into the model, which is not suitable for multi-sentence similarity determination. If the two most similar sentences are found from the set with n sentences, n x (n-1)/2 comparisons are needed, and each comparison needs to be transmitted to the BERT model for calculation, which is very expensive. The conventional way of characterizing a sentence by BERT is to take the output of the first [ CLS ] token or average all the outputs to represent a sentence, which often results in poor quality embedded embedding as proved by experiments. The text vectorization model of the embodiment is implemented by SBert (Sentence vector generation model, whole-english: sequence-BERT), SBert fine-tunes the pre-trained BERT by using the dual network structure in fig. 2, and updates the model parameters, so that the adjusted model can well semantically represent a Sentence, and the generated Sentence embedding can directly calculate cosine similarity through cos, so that the closer the distance of the Sentence with more similar semantics in the vector space is, the more recent the distance of the embedding vector is.

In an embodiment of the present invention, the keyword extraction module may include:

a first keyword extraction sub-module, configured to extract a plurality of first candidate keywords for describing a subject of the digital object from the original data based on a term frequency-inverse document frequency algorithm (TF-IDF) and calculate a weight of each of the first candidate keywords;

the second keyword extraction submodule is used for extracting a plurality of second candidate keywords for describing the subject of the digital object from the original data based on a Text sorting algorithm Text Rank and calculating the weight of each second candidate keyword;

As shown in table 1, in the dublin standard, the definition that the metadata item is "topic" is a topic description of the resource content, wherein a certain topic describing a specific resource may adopt a keyword, a classification number, and the like. Thus, where metadata with an item of metadata "subject" is a keyword in the common attribute metadata of a digital object, the keyword extraction module may be configured to extract a user-specified number of keywords from the title and summary describing the target digital object. In the related art, the keywords and the weights of the keywords are generally determined only by using a keyword extraction algorithm TF-IDF based on statistical characteristics or only by using a keyword extraction algorithm Text Rank based on a word graph model, but in practical application, due to the complexity of an application environment, for different types of texts, such as long texts and short texts, the effects obtained by using the same Text keyword extraction method are different, so that the embodiment performs weighted average on the weight results of the two algorithms to make up for the defects of a single algorithm. That is, the first keyword extraction sub-module of this embodiment adopts a keyword extraction algorithm TF-IDF based on statistical characteristics to extract a plurality of first candidate keywords for describing the subject of the digital object from the original data and calculate the weight of each first candidate keyword, the second keyword extraction sub-module adopts a keyword extraction algorithm Text Rank based on a word graph model to extract a plurality of second candidate keywords for describing the subject of the digital object from the original data and calculate the weight of each second candidate keyword, and finally, the keyword calculation sub-module calculates the weighted average of each keyword for all the first candidate keywords and the second candidate keywords, thereby realizing the selection of the top K keywords with the largest weights. Since both the statistical-feature-based keyword extraction algorithm TF-IDF and the word graph model-based keyword extraction algorithm Text Rank belong to the prior art, the technical implementation principle of this embodiment is not described herein again.

As shown in table 1, in the dublin standard, the resource content description defined as the digital object with the metadata item "description" is provided, and therefore, the metadata with the metadata item "description" is the resource content describing the target digital object, such as abstract, catalog, text description for graphics, and so on, and thus can be summarized as summary information, that is, the present invention can extract the summary information describing the digital object from the original data, and thus can obtain the metadata with the metadata item "description". In an embodiment of the present invention, the abstract extracting module may adopt an unsupervised abstraction-type abstract generating method, that is, a method of directly selecting a plurality of important sentences from the original data, and sequencing and recombining the important sentences to form an abstract, and specifically may adopt the following steps: the first step is as follows: existing sentences in the original document are augmented by a method fromRandomly sampling a part of phrases or words in a corpus to disorder, and then adding the phrases or words into the existing sentences to form longer sentences; the second step: compressing the long sentence according to an encoder-decoder framework, wherein an RNN decoder is adopted: h is_t=RNN(h_t−1, x_t,T_dec-T), wherein T_decIs a specified length of summary information, h_tIs the hidden state of the t-th layer of the encoder-decoder framework, h_t-1Is a hidden state of a layer above the t-th layer, x_tIs an external input, typically a token embedding of previous decoding; 3) constructing a loss function so that the sentences in the first step are as same as the sentences compressed in the second step as possible, and in a typical RNN encoder-decoder architecture, the final hidden state of the encoder is used as the initial hidden state of the decoder, i.e. h₀ ^dec=h_Tenc ^encTrain a full connection layer h₀ ^dec=f(h_Tenc ^encS) in which h₀ ^decRepresenting the initial hidden state of the decoder, h_Tenc ^encRepresenting the final hidden state of the encoder. h is₀ ^dec=f(h_Tenc ^encS), where f represents the fully-connected layer function and s is the pre-trained sentence InferSent embedding.

As shown in table 1, in the dublin standard, the metadata item is "format" defined as the physical or digital representation of the resource, such as the type, format, size, video duration and definition, database record number, etc.; the definition of a metadata item as "date" as a time associated with an event in the resource lifecycle can thus be summarized as "data attribute" information of the digital object metadata. In an embodiment of the present invention, the data attribute extraction module may adopt an os module in python to obtain metadata such as file type, format, size, and the like in the original data, may adopt a multimedia video processing tool ffmpeg to obtain metadata such as video duration and definition, and may adopt pymysql and psypg 2 to obtain metadata such as database record number.

As shown in table 1, in the dublin standard, the definition that the metadata item is "coverage" is the extent and coverage related to the resource content, and may specifically be a spatial location, a time interval, or other range of the administrative district. Therefore, the time class metadata can be represented by time information. In an embodiment of the present invention, the time information extraction module may extract time information from a chinese text and an english text in the original data, and normalize and express an extraction result in a date-time form, so that the extraction result may be used as the time-class metadata. Specifically, the datefinder module in python can be used for extracting the time information in the English text, and the regular expression is used for matching the time information in the Chinese text.

In an embodiment of the present invention, the region information extracting module includes:

The pre-selected geographic information service Application Interface (API) may be nomination, hundredth, gold, google, etc. First, the data set constructing submodule of this embodiment may use a stand-alone geographic information extraction method, and crawl geographic information respectively for international and domestic addresses from these APIs by using a crawler technology, to form a geographic information data set including 2,790,951 regions and cities of 230 countries, including geographic coordinates of each country, region, and city, where the domestic address is accurate to a village and a town, and the foreign address is accurate to a city.

Secondly, as shown in fig. 3, in order to improve the matching efficiency, the part-of-speech recognition submodule of the embodiment first performs word segmentation on the text, and then recognizes the target words with parts of speech being place names and transliterated place names by using a part-of-speech recognition technology. Finally, the semantic matching sub-module can match out the relevant region information conforming to the definition of the region information entity by adopting an algorithm model based on semantic matching, and the region metadata of the digital object is obtained.

As shown in table 1, metadata information of other metadata items in the dublin standard, such as topic names, creators, other responsible persons, identifiers, sources, languages, associations, rights, and the like, can be extracted by the semantic function extraction sub-module and the custom rule extraction sub-module. That is, the other metadata extraction module may include a semantic function extraction submodule and a custom rule extraction submodule, wherein:

and the custom rule extraction submodule is used for extracting second information similar to the semantic or structural characteristics of other target metadata items from the original data based on a rule pre-defined by a user, and taking the second information as metadata of the other target metadata items.

In an embodiment of the present invention, the semantic function extracting sub-module includes:

a key-value format document extracting unit for extracting a key-value format document from the raw data;

Specifically, the key-value format document extraction unit may include: a direct extraction subunit, configured to directly extract the key-value format document in the original data; the table extraction subunit is used for identifying the row names and/or column names of the tables in the data resource related description document, using the identified row names and/or column names as keys, and using the cell contents corresponding to the row names and/or column names as values to obtain a key-value format document; and the unstructured text extraction subunit is used for segmenting the unstructured text in the data resource related description document, and then matching a plurality of key-value pairs in the segmented unstructured text by utilizing a semantic template to obtain a key-value format document.

For key-value format documents such as json, xml, csv and the like, the embodiment may directly extract by using a direct extraction subunit, as shown in fig. 4, then use a semantic similarity calculation unit to respectively calculate a chinese name, an english name, an alias of a metadata item, and define semantic similarity of each key in a description document related to a data resource, take a key with the largest similarity and a similarity value greater than a threshold as an item matched with the metadata item, and take a value corresponding to the key as a value of the metadata item. For the table in the pdf document, as shown in fig. 4, the embodiment may use the table extraction subunit to firstly parse the table, identify the row name/column name of the table, use it as a key, use the cell corresponding to the row name/column name as a value, and then perform key-metadata item matching by the same method. For unstructured texts such as txt paragraphs and pdf paragraphs, as shown in fig. 4, in the embodiment, an unstructured text extraction subunit may be adopted to perform word segmentation on the text first, then match a value corresponding to a key in a sentence where a candidate key is located by using a semantic template, and perform key-metadata item matching by the same method as above within a key-value pair range where a value can be successfully returned according to the semantic template.

In an embodiment of the present invention, the custom rule extraction sub-module may include:

the semantic feature extraction unit is used for determining values corresponding to other target metadata items as values corresponding to keys in the original data as target words according to a user-defined semantic feature extraction rule defined by a user, extracting the values corresponding to the target words from the original data and using the values as metadata of the other target metadata items;

The custom rule extraction submodule can support a user to customize an extraction rule based on semantic features or structural features for original or newly added metadata items, or support to customize a knowledge extraction rule according to a knowledge base uploaded by the user. The semantic feature extraction unit extracts metadata from the original data based on a self-defined semantic feature extraction rule. For example, for a data resource related description document in a json format, a semantic feature extraction rule is formulated: the value corresponding to the 'creator' of the metadata item is the value corresponding to the 'development unit' of the key in the description document, and the information of the 'creator' of the metadata item can be extracted by utilizing the rule.

The structural feature extraction unit extracts metadata from the original data based on a customized structural feature extraction rule. The structural feature extraction rules may include visual feature and text feature extraction rules including position coordinates, color, font size, word spacing, whether bolding, hue, etc., extracting metadata information in the raw data. For example, for a data resource related description document in pdf tabular form, a structural feature extraction rule is formulated: the line name is the 'unit name' and the value in the cell with the bold font is the value corresponding to the 'name' of the metadata item, and the information that the metadata item is the 'name' can be extracted by utilizing the rule.

The knowledge extraction rule is an extraction rule generated based on a knowledge base uploaded by a user and is specifically realized through a knowledge extraction unit. For example, the knowledge base uploaded by the user includes two knowledge triplets (employee a, work unit, company a), (employee a, colleague, and employee B), and analyzes that employee a works in company a and employee B belong to colleagues, so that the information that the work unit of employee B is company a can be extracted according to the knowledge extraction rule. The embodiment supports the user to upload one or more knowledge bases in a triple format, and a rule template is formulated to guide the user to formulate a knowledge extraction rule.

In order to realize the analysis of the user-defined rule, as shown in table 2, a set of extraction rule templates is defined in the embodiment of the present invention, and the user may define the extraction rule based on the semantic feature or the structural feature or generate the knowledge extraction rule based on the knowledge base uploaded by the user according to the guidance of the templates. Table 2 lists some of the syntaxes defined in the template.

TABLE 2

Grammar for grammar	Explaining the meaning
		Metadata item abbreviated name: [ rule 1 content, [ rule 1 tag, rule 2 content; rule 2 tag, … …]	Defining the grammar of the extraction rule for a specific metadata item, wherein the rule labels have +1 and-1, and respectively represent the contents conforming to the rule Whether the value is that of the metadata item.
key	Indicating key
		value	Indicating a value
==	The text contents at both ends of the representation symbol are the same
		has_en()	Indicating the presence of English letters in the parentheses
has_zh()	Indicates that the Chinese character exists in the parenthesis
		has_digit ()	Indicating the presence of a number in parentheses

In an embodiment of the present invention, the apparatus may further include:

and the extended metadata item self-adapting module is used for providing an operable interface of the configuration file of the digital object for a user, acquiring metadata items and related definitions added or extended by the user in the operable interface, and storing the added or extended metadata items and related definitions in the form of the configuration file.

The extended metadata is a metadata item which is newly added when each business department finds that the common attribute metadata based on the dublin standard can not meet specific business scenes and business requirements. The extended metadata item adaptive module of the embodiment supports adaptive extraction of extended metadata formulated by a business department, a metadata standard (such as dublin standard) according to which a tool is input in the form of a configuration file, and when an existing standard needs to be extended, a user only needs to add an extended metadata item and related definitions in the configuration file through the extended metadata item adaptive module. For example: in a security unit, metadata items such as "security level" may need to be extended. The semantic feature extraction unit or the structural feature extraction unit of the present apparatus may be used to support the extraction of extended metadata items.

In the invention, all functional modules of the device can be set to be pluggable, namely, an automatic classification module, a keyword extraction module, an abstract extraction module, a data attribute extraction module, a time information extraction module, a region information extraction module and other metadata extraction modules are all set to be pluggable modules, the modules are mutually independent, and the metadata extraction is more flexible by expanding the setting of a metadata item self-adaptive module, namely, supporting the configuration of a user on functional parameters through a configuration file.

In addition, for the output form of the relational database table of the metadata information, the extensible requirement of the metadata schema and the problem of the expansion of the metadata data volume generated along with the popularization and the use of the registry system, the invention also designs a storage scheme of the metadata in the relational database, which supports the expansion of the metadata schema (the metadata schema is a set defined by metadata items) and the distributed storage of the metadata. Specifically, the apparatus of the present invention further comprises:

the extensible metadata storage module is used for writing a target metadata item newly added to the digital object currently into a metadata mode storage table, and automatically filling metadata of the target metadata item into the metadata mode storage table through a metadata item identifier of the target metadata item.

The extensible metadata storage module adopts an extensible metadata storage strategy, and the extensible metadata storage module is realized by the table structure design of a metadata mode storage table and a metadata storage table, when a new metadata item is added, only one record of the metadata item needs to be written in the metadata mode storage table, the external key association is identified through the metadata item, and the metadata record corresponding to the metadata item is added in the metadata storage table without changing the table structure. In short, the extensibility is realized by the fact that when a new metadata item is added, any table structure does not need to be modified, and the extensibility has the advantages of supporting metadata item extension, supporting metadata distributed storage and the like.

When the metadata item is stored in the implementation process, the tool uses a relation table ItemDO as a metadata mode storage table to store the original metadata item or the newly added target metadata item in a predetermined standard (such as Dublin standard), for example, based on the definition of the metadata item by the ISO/IEC11179 standard adopted by the Dublin core standard, an implementable mode of the table is designed: (SDID, ItemDOID, PID, nodeType, nodeName, zhName, enName, alias, zhDec, enDec, dataType, minOccurs, maxOccurs, remarks), wherein the definitions of each item are shown in Table 3:

TABLE 3

Column name	Explaining the meaning
		SDID	And the version ID of the metadata schema in which the metadata item is positioned indicates that the metadata item is a common attribute metadata item or other various versions of extended metadata.
ItemDOID	A metadata item identification.
		PID	An identification of a metadata entity containing a metadata item, i.e. a metadata item parent identification.
nodeType	Indicating whether the metadata item contains a child item, i.e. is a metadata entity, 1 means yes and 0 means no.
		nodeName	Abbreviated names of metadata items.
zhName	Chinese name of metadata item.
		enName	English name of the metadata item.
alias	An alias of the metadata item.
		zhDec	Chinese definition of metadata items.
enDec	English definition of metadata items.
		minOccurs	Indicating the alternatives of the metadata items, 1 represents mandatory and 0 represents optional.
maxOccurs	Maximum frequency of use of metadata items, -1 represents an unlimited number of uses.
		remarks	Remarks for metadata items.

For the storage of the metadata, the tool uses a relational table MetaDO as a metadata storage table to store the metadata of each digital object, the metadata item definition information is already stored in the metadata item table, the information does not need to be repeatedly stored, and only the value information of the metadata needs to be stored in a database. The metadata table is thus designed as (MetaDOID, ItemDOID, PID, Value), where the definitions of the terms are shown in Table 4:

TABLE 4

Column name	Explaining the meaning
		MetaDOID	Digital object identification or metadata identification of a digital object
ItemDOID	Metadata item identification
		PID	Metadata item dads, i.e. metadata entity identifications
Value	Value of metadata item

The table structure design can meet the extensible requirement of the metadata mode, when one metadata item needs to be added, only one record needs to be added in the metadata mode storage table, the SDID of the metadata mode storage table maintains the version information of the extended metadata item, and the metadata table structure does not need to be modified. Where the SDID is an ID of a version of the metadata schema to which the metadata item belongs. For example, all metadata items in a metadata pattern (dublin standard) containing a common attribute correspond to an SDID of 0, and all metadata items in a secret metadata pattern specified by a secret unit correspond to an SDID of 1.

The structure of the table is similar to that of a ternary table, except that the above storage scheme establishes an index on the structure of the metadata item for the metadata by arranging a column in the two tables for storing a metadata item parent class, namely metadata entity identification, and the establishment of the index has the following purposes: 1) optimization of query efficiency is provided for basic pattern queries: the main defect of ternary table storage is that a large number of self-connections are generated during query, query efficiency is influenced, and by establishing a mode structure index of Huffman prefix codes, the number of self-connections can be reduced for upper and lower relation query based on mode classification, so that the query efficiency is optimized; 2) a richer data partitioning scheme is provided for distributed storage: the storage structure of the ternary table can support distributed storage under a specific data partitioning scheme, such as partitioning based on data types, partitioning based on relationship types and the like, and more partitioning schemes based on modes can be provided for data partitioning by establishing mode structure indexes, such as partitioning based on metadata entities, partitioning based on metadata entities of specific levels and the like; 3) the method supports the metadata sharing and interoperation with other external systems in the form of XML documents, and supports the generation of XML Schema documents and metadata XML documents conforming to the metadata Schema definition by the relation table through establishing the Schema structure index.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. "and/or" means that either or both of them can be selected. Also, the terms "include", "including" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article, or terminal device including a series of elements includes not only those elements but also other elements not explicitly listed or inherent to such process, method, article, or terminal device. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The metadata automatic generation device for digital object modeling provided by the invention is described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the above embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An apparatus for automatically generating metadata for modeling a digital object, the apparatus comprising:

the keyword extraction module is used for extracting metadata with metadata items as subjects from the original data;

2. The apparatus for automatically generating metadata for modeling oriented to digital objects as claimed in claim 1, wherein the metadata whose metadata item is a type includes a category name of the digital object; the automatic classification module comprises:

the text vectorization submodule is used for mapping each pre-obtained classification option defined by a user and the digital object description abstract in the original data into a unified vector space to generate embedding of a plurality of classification options with the same dimension;

3. The apparatus for automatically generating metadata for modeling oriented to digital objects according to claim 1, wherein said keyword extraction module comprises:

a first keyword extraction sub-module, configured to extract, from the raw data, a plurality of first candidate keywords for describing a topic of the digital object based on a word frequency-inverse document frequency algorithm and calculate a weight of each of the first candidate keywords;

a second keyword extraction sub-module, configured to extract, from the original data, a plurality of second candidate keywords for describing a subject of the digital object based on a text sorting algorithm and calculate a weight of each of the second candidate keywords;

4. The apparatus for automatically generating metadata for modeling based on digital objects according to claim 1, wherein the region information extraction module comprises:

the part-of-speech recognition submodule is used for segmenting the text in the original data and then recognizing target words with parts of speech being place names and transliterated place names from the segmented text;

and the semantic matching sub-module is used for performing semantic matching on the target words and the plurality of geographic information in the geographic information data set, determining the region information of the digital object from the target words, and taking the region information as the region metadata of the digital object.

5. The apparatus according to claim 1, wherein the other metadata extraction module comprises a semantic function extraction sub-module and a custom rule extraction sub-module, wherein:

and the custom rule extraction submodule is used for extracting second information similar to the semantics or structural features of other target metadata items from the original data based on a rule pre-defined by a user, and taking the second information as metadata of the other target metadata items.

6. The apparatus for automatically generating metadata for modeling digital objects according to claim 5, wherein said semantic function extracting sub-module comprises:

a key-value format document extraction unit for extracting a key-value format document from the raw data;

and the semantic similarity calculation unit is used for calculating the Chinese name, English name and alias of the target other metadata item aiming at the key-value format document, defining the semantic similarity with each key in the key-value format document, taking the key with the semantic similarity larger than a preset threshold and the maximum semantic similarity as an item matched with the target other metadata item, and taking the value corresponding to the key as the metadata of the target other metadata item.

7. The apparatus of claim 6, wherein the key-value format document extracting unit comprises:

a direct extraction subunit, configured to directly extract a key-value format document in the original data;

a table extraction subunit, configured to identify, for a table in the data resource related description document, a row name and/or a column name of the table, use the identified row name and/or column name as a key, and use cell content corresponding to the row name and/or column name as a value to obtain the key-value format document;

and the unstructured text extraction subunit is used for segmenting the unstructured text in the data resource related description document, and then matching a plurality of key-value pairs in the segmented unstructured text by utilizing a semantic template to obtain the key-value format document.

8. The apparatus for automatically generating metadata for modeling oriented to digital objects according to claim 5, wherein said custom rule extraction submodule comprises:

the semantic feature extraction unit is used for determining values corresponding to the target other metadata items as values corresponding to keys in the original data as target words according to a user-defined semantic feature extraction rule, extracting the values corresponding to the target words from the original data, and using the values as metadata of the target other metadata items;

the structural feature extraction unit is used for determining values corresponding to the other target metadata items as values corresponding to keys in the original data which are target characters and have target font formats according to a user-defined visual feature extraction rule and a user-defined character feature extraction rule, extracting the values corresponding to the target characters and the target font formats from the original data, and using the extracted values as metadata of the other target metadata items;

9. The apparatus for automatically generating metadata for modeling digital objects according to claim 1, wherein said apparatus further comprises:

the extended metadata item self-adapting module is used for providing an operable interface of the configuration file of the digital object for a user, acquiring metadata items and related definitions added or extended by the user in the operable interface, and storing the added or extended metadata items and related definitions in the form of the configuration file.

10. The apparatus for automatically generating metadata for modeling digital objects according to claim 1, wherein said apparatus further comprises:

an extensible metadata storage module comprising a metadata pattern storage table and a metadata storage table identifying foreign key associations based on metadata items, the metadata pattern storage table for storing metadata items of the digital object; in the metadata mode storage table, a metadata item identifier and a metadata item parent identifier adopt Huffman prefix coding; the metadata storage table is used for storing metadata of each metadata item of the digital object;

the extensible metadata storage module is used for writing a target metadata item newly added currently for the digital object into the metadata mode storage table, and automatically filling metadata of the target metadata item into the metadata mode storage table through a metadata item identifier of the target metadata item.