CN113535968A - Method and device for extracting key attributes of data - Google Patents

Method and device for extracting key attributes of data Download PDF

Info

Publication number
CN113535968A
CN113535968A CN202010312666.6A CN202010312666A CN113535968A CN 113535968 A CN113535968 A CN 113535968A CN 202010312666 A CN202010312666 A CN 202010312666A CN 113535968 A CN113535968 A CN 113535968A
Authority
CN
China
Prior art keywords
attribute
key
data
name
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010312666.6A
Other languages
Chinese (zh)
Inventor
邸志惠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202010312666.6A priority Critical patent/CN113535968A/en
Publication of CN113535968A publication Critical patent/CN113535968A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Abstract

The invention discloses a method and a device for extracting key attributes of data, and relates to the technical field of computers. One embodiment of the method comprises: extracting structured attribute information from the description information of the data, wherein the structured attribute information comprises attribute names and attribute values corresponding to the attribute names; performing normalization processing on the attribute names by performing similar calculation on the attribute values, and determining key attribute names from the attribute names after the normalization processing; constructing a key attribute knowledge graph according to each key attribute name and the corresponding attribute value thereof; and extracting the key attributes of the data based on the key attribute knowledge graph. According to the implementation mode, the key attribute knowledge graph constructed on the basis of processing of the structured attribute information can be used for efficiently extracting the key attribute of the unstructured information, the dependence on a grammar structure and labeled data is not needed, and the efficiency and the accuracy of extracting the key attribute of the data are greatly improved.

Description

Method and device for extracting key attributes of data
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for extracting key attributes of data.
Background
Due to the fact that the types of data are various, the key attributes of different types of data are distinguished, the problem of matching of all data in a big data scene is solved by only one algorithm model is almost unrealistic, and key information capable of distinguishing the data is required to be extracted according to the characteristics of the different types of data, so that how to extract the key attributes of the multi-type data becomes a problem which needs to be solved.
At present, key attribute extraction methods existing in the field of data search or lookup mainly include a method based on pattern matching, a method based on statistics (traditional machine learning and deep learning), a method based on weak supervision/corpus expansion, and the like.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
the existing key attribute extraction method in the field of data search or lookup cannot extract the key attributes of unstructured information efficiently, and depends on a grammatical structure and labeled data, so that the efficiency and accuracy of extracting the key attributes of the data are low.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for extracting a key attribute of data, which can perform efficient key attribute extraction on unstructured information based on a key attribute knowledge graph constructed by processing structured attribute information, and do not need to rely on a syntactic structure and labeled data, thereby greatly improving efficiency and accuracy of extracting the key attribute of data.
To achieve the above object, according to an aspect of the embodiments of the present invention, a method for extracting a key attribute of data is provided.
A method for extracting data key attributes comprises the following steps: extracting structured attribute information from description information of data, wherein the structured attribute information comprises attribute names and attribute values corresponding to the attribute names; performing normalization processing on the attribute names by performing similar calculation on the attribute values, and determining key attribute names from the attribute names after the normalization processing; constructing a key attribute knowledge graph according to each key attribute name and the corresponding attribute value thereof; and extracting the key attributes of the data based on the key attribute knowledge graph.
Optionally, performing normalization processing on the attribute names by performing similarity calculation on the attribute values, and determining the key attribute name from the attribute names after the normalization processing includes: performing similarity calculation on the attribute values, and determining a group of attribute names with the similarity of the attribute values meeting a set threshold as a similar attribute name set; for each similar attribute name set, selecting an attribute name from the similar attribute name sets as the attribute name after normalization processing, and combining and de-duplicating the attribute value corresponding to each attribute name in the similar attribute name sets to obtain the attribute value corresponding to the attribute name after normalization processing; and determining the key attribute name according to the importance of the attribute value corresponding to each attribute name after normalization processing to the data.
Optionally, before constructing the key attribute knowledge graph according to each key attribute name and the attribute value corresponding to the key attribute name, the method further includes: and for each key attribute name, if the number of the attribute values corresponding to the key attribute name is less than a set number threshold, performing attribute value expansion according to the attribute values corresponding to the key attribute name in an external knowledge base.
Optionally, before constructing the key attribute knowledge graph according to each key attribute name and the attribute value corresponding to the key attribute name, the method further includes: for each key attribute name, if the attribute value corresponding to the key attribute name is a composite attribute value, splitting the composite attribute value, and constructing a replacement value set of the composite attribute value.
Optionally, before constructing the key attribute knowledge graph according to each key attribute name and the attribute value corresponding to the key attribute name, the method further includes: based on an external knowledge base, respectively acquiring a deactivation word bank, a synonym bank, a superior-inferior relation bank and a key attribute bank of each key attribute name and the corresponding attribute value thereof, wherein the key attribute bank comprises the key attribute name, the corresponding attribute value thereof and a replacement value set of a composite attribute value.
Optionally, constructing the key attribute knowledge graph according to each key attribute name and the attribute value corresponding to the key attribute name includes: and constructing a key attribute knowledge graph according to the stop word library, the synonym library, the upper and lower level relation library and the key attribute library of each key attribute name and the corresponding attribute value thereof.
Optionally, the extracting the key attribute of the data based on the key attribute knowledge graph includes: extracting structured attribute information from description information of data, deleting invalid attribute information from the structured attribute information based on the disused thesaurus to obtain first attribute information, and extracting first key attributes from the first attribute information based on the synonym thesaurus and the key attribute library; extracting unstructured attribute information from description information of data, and matching the unstructured attribute information based on the synonym library and the key attribute library to obtain a second key attribute, wherein the first key attribute and the second key attribute comprise key attribute names and corresponding attribute values; and extracting data key attributes from the first key attributes and the second key attributes based on the composite attribute values and the alternative value sets thereof in the key attribute library, the synonym library and the superior-inferior relation library.
Optionally, after performing data key attribute extraction based on the key attribute knowledge graph, the method further includes: and analyzing the effectiveness and uniqueness of the extracted key attributes of the data so as to optimize the key attribute knowledge graph.
Optionally, if the description information of the data is picture data, performing text recognition on the picture data to obtain text description information, and extracting data key attributes from the text description information.
According to another aspect of the embodiment of the invention, an extraction device for the key attributes of the data is provided.
An extraction device for data key attributes comprises: the information extraction module is used for extracting structured attribute information from the description information of the data, wherein the structured attribute information comprises an attribute name and an attribute value corresponding to the attribute name; the attribute processing module is used for carrying out normalization processing on the attribute names by carrying out similarity calculation on the attribute values and determining key attribute names from the attribute names after the normalization processing; the map building module is used for building a key attribute knowledge map according to each key attribute name and the corresponding attribute value thereof; and the attribute extraction module is used for extracting the key attributes of the data based on the key attribute knowledge graph.
According to another aspect of the embodiment of the invention, an electronic device for extracting key attributes of data is provided.
An electronic device for extracting key attributes of data, comprising: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the method for extracting the key attributes of the data provided by the embodiment of the invention.
According to yet another aspect of embodiments of the present invention, a computer-readable medium is provided.
A computer readable medium, on which a computer program is stored, which when executed by a processor implements the method for extracting key attributes of data provided by embodiments of the present invention.
One embodiment of the above invention has the following advantages or benefits: extracting structured attribute information from the description information of the data, wherein the structured attribute information comprises attribute names and corresponding attribute values; performing normalization processing on the attribute names by performing similar calculation on the attribute values, and determining key attribute names from the attribute names after the normalization processing; constructing a key attribute knowledge graph according to each key attribute name and the corresponding attribute value thereof; the key attribute extraction is carried out on the data based on the key attribute knowledge graph, so that the efficient key attribute extraction of the unstructured information is carried out on the unstructured information based on the key attribute knowledge graph constructed by processing the structured attribute information, the dependence on grammatical structures and labeled data is not needed, and the efficiency and the accuracy of the key attribute extraction of the data are greatly improved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main steps of a method for extracting key attributes of data according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a key attribute knowledge graph building process according to one embodiment of the present invention;
FIG. 3 is a schematic diagram of a data key attribute extraction process according to an embodiment of the invention;
FIG. 4 is a schematic diagram of the main modules of an apparatus for extracting key attributes of data according to an embodiment of the present invention;
FIG. 5 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 6 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The data query problem is similar to the search scenario problem, for example: when a certain commodity or a certain commodity needs to be inquired from a large quantity of commodities of an e-commerce platform, the commodity which the user wants needs to be found as accurately and comprehensively as possible, and the tolerance on the error result or incomplete data is low.
The structuralization of the key attribute information of the commodities is used as a very important link in a commodity matching task, and the efficiency and the quality of the information seriously affect the commodity matching effect. The structured nature of the key attribute information of the commodity is the extraction of the key attribute information of the commodity, and although a large number of information extraction methods (including entity extraction, relation extraction and time extraction, and specific information is shown as follows) exist in the search field, the problem of cross-field information extraction (different fields need to be marked independently and trained again, so that the land cost of a single-class project is high) cannot be solved by using the same algorithm at present. For the existing key attribute extraction method in the field of data search or search, the problems are as follows:
1. the method based on the pattern matching is used for extracting key attributes based on the logic grammar of the text data, and the key attributes can not be extracted for the text data stacked by phrases without grammar structures;
2. based on a statistical method, a large amount of labeled data needs to be trained to obtain a deep learning model, so that a large amount of text labeling cost needs to be consumed;
3. the method based on weak supervision/corpus expansion needs to introduce a large amount of knowledge similar to the text data features, and the introduced similar knowledge needs to be constructed and causes the accuracy of the result of extracting the key features to be reduced after the similar knowledge is introduced.
In order to solve the problems in the prior art, the invention provides a method for extracting key attributes of data, which can extract key attributes of text data stacked by phrases without a syntactic structure without depending on syntactic or syntactic tree information and a large amount of labeled data. Taking the application scene of extracting the key information of the commodities of the electronic commerce platform as an example, the text description information of the commodities comprises both the structured information and the unstructured information, and the invention realizes the efficient extraction of the key attributes of the unstructured information of the commodities (such as commodity titles, commodity detailed description and the like) by constructing a high-quality commodity key attribute knowledge graph based on the commodity structured information with low quality of the electronic commerce platform.
Fig. 1 is a schematic diagram of main steps of a method for extracting data key attributes according to an embodiment of the present invention. As shown in fig. 1, the method for extracting data key attributes according to the embodiment of the present invention mainly includes the following steps S101 to S104.
Step S101: extracting structured attribute information from the description information of the data, wherein the structured attribute information comprises attribute names and attribute values corresponding to the attribute names;
step S102: performing normalization processing on the attribute names by performing similar calculation on the attribute values, and determining key attribute names from the attribute names after the normalization processing;
step S103: constructing a key attribute knowledge graph according to each key attribute name and the corresponding attribute value thereof;
step S104: and extracting the key attributes of the data based on the key attribute knowledge graph.
According to the steps from S101 to S104, the key attribute name is obtained by extracting and processing the structured attribute information, then the key attribute knowledge graph is constructed based on the key attribute name and the attribute value, and the key attribute of the data is extracted based on the key attribute knowledge graph, so that the efficient key attribute extraction of the unstructured information of the data can be realized, the dependence on a syntactic structure, labeled data and the like is not required, and the efficiency of extracting the key attribute of the data is greatly improved.
According to an embodiment of the present invention, structured attribute information refers to attribute information from which attribute names and attribute values are directly available, such as: "Brand: a', from this piece of data, the attribute name "brand" and the attribute value "A" can be directly obtained. A large amount of structured attribute information can be obtained by processing data extracted from a big data platform, so that a large amount of attribute names and attribute values corresponding to the attribute names are obtained, and subsequent processing is facilitated.
According to an embodiment of the present invention, when performing the normalization process on the attribute names by performing the similarity calculation on the attribute values and determining the key attribute names from the attribute names after the normalization process in step S102, the method may specifically include:
performing similarity calculation on the attribute values, and determining a group of attribute names with the similarity of the attribute values meeting a set threshold as a synonymy attribute name set;
for each synonymous attribute name set, selecting an attribute name from the synonymous attribute name set as the attribute name after normalization processing, and combining and de-duplicating the attribute value corresponding to each attribute name in the synonymous attribute name set to obtain the attribute value corresponding to the attribute name after normalization processing;
and determining the key attribute name according to the importance of the attribute value corresponding to each attribute name after normalization processing to the data.
A large number of attribute names and their corresponding attribute values are obtained in step S101, and in order to determine the key attribute, the key attribute name and its corresponding attribute value need to be determined. When determining the key attribute names, different data providers may have different names for the same attribute, so in order to extract the key attribute names more accurately and reasonably, normalization processing may be performed on the attribute names according to the similarity degree between the attribute values of each attribute name, so as to normalize the similar attribute names into the same attribute name.
Specifically, in calculating the degree of similarity between attribute values, a similarity algorithm may be employed, for example: calculating the proportion of the same character string between the two attribute values, calculating the editing distance of the text pinyin, calculating the cosine similarity or the hamming distance between the vectors corresponding to the two attribute values, and the like. After the similarity between the attribute values is obtained, a group of attribute names whose similarity satisfies a set threshold can be determined as a synonymous attribute name set, so that a large number of attribute names can be divided into a plurality of synonymous attribute name sets according to the similarity of the attribute values.
For each synonymous attribute name set, one attribute name may be arbitrarily selected as the attribute name after the normalization processing, the attribute name with the highest frequency of appearance may be used as the attribute name after the normalization processing, or the attribute name may be selected as the attribute name after the normalization processing according to another set rule. The invention is not limited in this regard. Then, it is necessary to determine the attribute value corresponding to the attribute name after the normalization processing. Because each attribute name corresponds to an attribute value in the synonymous attribute name set corresponding to the attribute name after the normalization processing, the attribute values have similarity but are not completely the same, the attribute values corresponding to each attribute name in the synonymous attribute name set can be merged and deduplicated to obtain the attribute value corresponding to the attribute name after the normalization processing. Thus, the normalized attribute names corresponding to each synonymous attribute name set and the corresponding attribute values (possibly a plurality of) thereof can be obtained.
Then, the key attribute name can be determined from the normalized attribute names. When the key attribute name is determined, the importance of the attribute value corresponding to the attribute name after normalization processing to the data is determined. Specifically, when calculating the importance of the attribute value to the data, the importance may be calculated according to the frequency of occurrence of the attribute value in all attribute values before deduplication, or the importance of different attribute values may be set in combination with a priori knowledge, or the like. According to the importance of each attribute value, the attribute name after normalization processing corresponding to the attribute value with higher importance can be selected as the key attribute name.
After the key attribute name and the attribute value corresponding to the key attribute name are determined, a key attribute knowledge graph can be constructed.
In one embodiment of the invention, before constructing the key attribute knowledge graph according to each key attribute name and the corresponding attribute value thereof, attribute value expansion can be performed when the attribute values are less. Specifically, for each key attribute name, if the number of attribute values corresponding to the key attribute name is less than a set number threshold, the attribute value is extended according to the attribute value corresponding to the key attribute name in the external knowledge base. At present, attribute values are mainly expanded from existing industry knowledge maps and search results, and specifically, expanded values of the attribute values can be obtained in a crawler manner.
In another embodiment of the present invention, before constructing the key attribute knowledge graph according to each key attribute name and its corresponding attribute value, the attribute value may be further subjected to finer-grained splitting. Specifically, for each key attribute name, if the attribute value corresponding to the key attribute name is a composite attribute value, the composite attribute value is split, and a replacement value set of the composite attribute value is constructed. There may be multiple attribute values for each key attribute name, including possibly a composite attribute value. The composite attribute value refers to an attribute value containing a sub-attribute value, and can be split into the sub-attribute values contained in the composite attribute value. For example: the attribute values corresponding to the attribute name of the washing machine type comprise a roller pulsator washing machine, a roller washing machine and a pulsator washing machine, wherein the attribute value of the roller pulsator washing machine is a composite attribute value and needs to be split into the roller washing machine and the pulsator washing machine. The invention sets a splitting rule of the attribute value to judge whether the attribute value in the key attribute name is a composite attribute value or not and how to split the composite attribute value. And splitting the composite attribute value to obtain an attribute value, namely the replacement value set of the composite attribute value.
In another embodiment of the present invention, before constructing the key attribute knowledge graph according to each key attribute name and its corresponding attribute value, a deactivation lexicon, a synonym lexicon, a superior-inferior relation library and a key attribute library of each key attribute name and its corresponding attribute value may be further obtained based on the external knowledge library, where the key attribute library includes the key attribute name and its corresponding attribute value, and a replacement value set of composite attribute values. The synonym library of the key attribute names comprises a synonym attribute name set corresponding to the key attribute names and synonyms of the key attribute names provided in the knowledge base; the deactivation thesaurus and the context relation base, as well as the thesaurus of attribute values, may be derived based on an external knowledge base.
According to the embodiment of the invention, a key attribute knowledge graph is constructed according to each key attribute name and the corresponding attribute value thereof, namely: and constructing a key attribute knowledge graph according to the stop word library, the synonym library, the upper and lower level relation library and the key attribute library of each key attribute name and the corresponding attribute value thereof.
After the key attribute knowledge graph is constructed, the key attribute of the data can be extracted based on the key attribute knowledge graph.
Generally, the description information for the data includes both structured attribute information and unstructured attribute information. For example: the commodity description information of the e-commerce platform includes structured attribute information of the commodity, such as "brand: a ' also includes non-structural attribute information of the commodity, such as ' pure cotton material, multi-color matching '.
When data key attribute extraction is performed, specifically, the following steps may be performed:
extracting structural attribute information from the description information of the data, deleting invalid attribute information from the structural attribute information based on the disuse word library to obtain first attribute information, and then extracting a first key attribute from the first attribute information based on the key attribute library;
extracting unstructured attribute information from the description information of the data, and matching the unstructured attribute information based on a key attribute library to obtain a second key attribute, wherein the first key attribute and the second key attribute comprise key attribute names and corresponding attribute values;
and extracting the data key attributes from the first key attributes and the second key attributes based on the composite attribute values and the alternative value sets thereof in the key attribute library, the synonym library and the superior and inferior relation library.
For the structured attribute information, after extracting the attribute name and the attribute value from the structured attribute information, firstly, the attribute name corresponding to the stop word can be deleted from the attribute name included in the structured attribute information according to the stop word library corresponding to the attribute name, so that the invalid attribute information is deleted to obtain first attribute information, and then, the first key attribute can be extracted from the first attribute information based on the key attribute name and the attribute value corresponding to the key attribute name in the key attribute library and the synonym library of the key attribute name.
For the unstructured attribute information, the second key attribute can be obtained by matching from the unstructured attribute information only according to the key attribute names in the key attribute library and the attribute values corresponding to the key attribute names and the synonym library of the key attribute names.
Then, judging whether a composite attribute value needing splitting exists or not according to the composite attribute value and a replacement value set thereof in the key attribute library for the first key attribute and the second key attribute, and performing attribute value replacement; normalizing the attribute values according to the synonym library corresponding to the attribute values to enable the attribute values to be normalized to comparable word dimensions; and merging the superior and inferior words existing in the attribute values according to the superior and inferior relational library, deleting the superior words and keeping the inferior words. Thus, the key attribute name and the attribute value thereof can be obtained from the description information of the data.
After the data key attributes are extracted based on the key attribute knowledge graph, the effectiveness and uniqueness of the extracted data key attributes can be analyzed so as to optimize the key attribute knowledge graph.
According to another embodiment of the invention, if the description information of the data is the picture data, text recognition is performed on the picture data to obtain text description information, and then the key attributes of the data are extracted from the text description information.
The following describes a specific implementation process of the key attribute extraction method of the present invention, taking the extraction of the key attribute information of a product from the product description information of an e-commerce platform as an example.
According to the embodiment of the invention, the extraction effectiveness of the key attributes of the commodities is remarkably improved mainly through two steps, and the attribute values are enabled to reach the same comparable granularity as much as possible. Firstly, by constructing a commodity key attribute knowledge graph, key attribute names and attribute value ranges, superior and inferior words, synonyms, stop words, compound word replacement rules and mutual exclusion relation words which are mainly concerned by certain commodities are clarified. And then, based on the constructed commodity knowledge graph, the attribute values corresponding to the key attribute names of the commodity are extracted from the commodity structure and non-structure information more efficiently, and data support is provided for subsequent commodity matching.
FIG. 2 is a diagram illustrating the process of building a key attribute knowledge-graph according to one embodiment of the present invention. As shown in fig. 2, in the embodiment of the present invention, the construction of the key attribute knowledge-graph mainly includes the following steps:
1. collecting and extracting the structured information (attribute name + attribute value) of all commodities under a certain final-stage category under a big data platform;
2. extracting all attribute names and corresponding attribute values from the structured information;
3. selecting a key attribute name: firstly, realizing the normalization of the synonymous attribute name through a similarity algorithm of an attribute value set; secondly, sorting the importance degree according to the proportion of the attribute values corresponding to the normalized attribute names in the titles and the attributes, and selecting the attribute names with the specified number capable of effectively distinguishing the commodities as key attribute names;
4. combining and removing duplication according to the normalization result of the key attribute names given in the step 3 to obtain an attribute value corresponding to each key attribute name;
5. when the attribute values corresponding to the key attribute names obtained in the step 4 are less, the attribute values corresponding to the attribute names are introduced into an external knowledge base to expand the attribute values;
6. judging whether a composite attribute value needs to be split or not according to the key attribute name and the corresponding attribute value collected in the step, and constructing a splitting rule;
7. according to the splitting rule, carrying out finer-grained splitting on the attribute value corresponding to the key attribute name of the commodity;
8. and (4) constructing a deactivation word library, a synonym library, a superior-inferior relation library and a key attribute library corresponding to each key attribute name and attribute value through an external knowledge library based on the key attribute names and attribute values constructed in the step (7), wherein the key attribute library comprises the key attribute names and corresponding attribute values, and the attribute values comprise the attribute values before splitting and the attribute values after splitting. In addition, a mutual exclusion library of attribute values can be constructed;
9. a key attribute knowledge graph is constructed based on all the data in step 8.
FIG. 3 is a diagram illustrating a data key attribute extraction process according to an embodiment of the present invention. As shown in fig. 3, attribute value information corresponding to a key attribute name in a commodity is extracted based on the key attribute knowledge graph of the commodity constructed in fig. 2. The method can be divided into two parts according to the type of the description information of the commodity: 1) structured attribute information; 2) unstructured attribute information.
1. For 1) structured attribute information, firstly, based on a deactivation lexicon, removing invalid attribute values under corresponding attribute names to obtain first attribute information; then, extracting a first key attribute from the first attribute information based on the synonym library and the key attribute library, wherein the first key attribute comprises a first key attribute name and a corresponding attribute value;
2. for 2) unstructured attribute data, matching a second key attribute from unstructured attribute information based on a synonym library and a key attribute library to obtain a second key attribute, wherein the second key attribute comprises a second key attribute name and a corresponding attribute value;
3. splitting the composite attribute values in the attribute values obtained in the steps 1 and 2 in a finer granularity mode based on the composite attribute values and the replacement value sets thereof in the key attribute library;
4. merging the upper and lower words existing in the key attribute values based on the upper and lower relational databases, reserving the lower words, and deleting the upper words;
5. based on the synonym library, normalizing the attribute values under the key attribute names to enable the attribute values to be normalized to the same comparable word dimension;
6. and analyzing the effectiveness and uniqueness of the extracted key attributes of the data so as to optimize the key attribute knowledge graph.
According to the embodiment of the invention, simulation experiment results of commodity matching of a plurality of categories show that the method provided by the invention can be used for realizing extraction of the key attributes of commodities, and the accuracy (non-vacancy rate) is improved from 69% to 92% on average. Because the technical scheme of the invention mainly solves the problem of extracting the text information, when the attribute information of the commodity is contained in the picture, the key attribute name and the attribute value of the commodity which are missing in the text information need to be supplemented by a text recognition R method.
Fig. 4 is a schematic diagram of main modules of an apparatus for extracting data key attributes according to an embodiment of the present invention. As shown in fig. 4, the apparatus 400 for extracting data key attributes according to the embodiment of the present invention mainly includes an information extraction module 401, an attribute processing module 402, a graph construction module 403, and an attribute extraction module 404.
An information extraction module 401, configured to extract structured attribute information from description information of data, where the structured attribute information includes an attribute name and an attribute value corresponding to the attribute name;
an attribute processing module 402, configured to perform normalization processing on attribute names by performing similarity calculation on attribute values, and determine key attribute names from the attribute names after the normalization processing;
the map construction module 403 is configured to construct a key attribute knowledge map according to each key attribute name and the attribute value corresponding to the key attribute name;
and an attribute extraction module 404, configured to perform data key attribute extraction based on the key attribute knowledge graph.
According to an embodiment of the present invention, the attribute processing module 402 may be further configured to:
performing similarity calculation on the attribute values, and determining a group of attribute names with the similarity of the attribute values meeting a set threshold as a synonymy attribute name set;
for each synonymous attribute name set, selecting one attribute name from the synonymous attribute name set as an attribute name after normalization processing, and combining and de-duplicating an attribute value corresponding to each attribute name in the synonymous attribute name set to obtain an attribute value corresponding to the attribute name after normalization processing;
and determining the key attribute name according to the importance of the attribute value corresponding to each attribute name after normalization processing to the data.
According to another embodiment of the present invention, the apparatus 400 for extracting data key attributes may further include an attribute value expansion module (not shown in the figure) for:
before a key attribute knowledge graph is constructed according to each key attribute name and the corresponding attribute value thereof, for each key attribute name, if the number of the attribute values corresponding to the key attribute name is less than a set number threshold value, attribute value expansion is carried out according to the attribute values corresponding to the key attribute name in an external knowledge base.
According to another embodiment of the present invention, the apparatus 400 for extracting data key attributes may further include an attribute value splitting module (not shown in the figure) configured to:
before a key attribute knowledge graph is constructed according to each key attribute name and an attribute value corresponding to the key attribute name, for each key attribute name, if the attribute value corresponding to the key attribute name is a composite attribute value, splitting the composite attribute value, and constructing a replacement value set of the composite attribute value.
According to still another embodiment of the present invention, the apparatus 400 for extracting data key attributes may further include an attribute extension module (not shown in the figure) for:
based on an external knowledge base, respectively acquiring a deactivation word bank, a synonym bank, a superior-inferior relation bank and a key attribute bank of each key attribute name and the corresponding attribute value thereof, wherein the key attribute bank comprises the key attribute name, the corresponding attribute value thereof and a replacement value set of a composite attribute value.
According to an embodiment of the present invention, the map building module 403 may be further configured to:
and constructing a key attribute knowledge graph according to the stop word library, the synonym library, the upper and lower level relation library and the key attribute library of each key attribute name and the corresponding attribute value thereof.
According to another embodiment of the present invention, the attribute extraction module 404 may be further configured to:
extracting structured attribute information from description information of data, deleting invalid attribute information from the structured attribute information based on the disused thesaurus to obtain first attribute information, and extracting first key attributes from the first attribute information based on the synonym thesaurus and the key attribute library;
extracting unstructured attribute information from description information of data, and matching the unstructured attribute information based on the synonym library and the key attribute library to obtain a second key attribute, wherein the first key attribute and the second key attribute comprise key attribute names and corresponding attribute values;
and extracting data key attributes from the first key attributes and the second key attributes based on the composite attribute values and the alternative value sets thereof in the key attribute library, the synonym library and the superior-inferior relation library.
According to another embodiment of the present invention, the data key attribute extraction apparatus 400 may further include a map optimization module (not shown in the figure) for:
after extracting the key attributes of the data based on the key attribute knowledge graph, carrying out effectiveness and uniqueness analysis on the extracted key attributes of the data so as to optimize the key attribute knowledge graph.
According to another embodiment of the present invention, if the description information of the data is the picture data, text recognition is performed on the picture data to obtain text description information, and then the data key attribute is extracted from the text description information.
According to the technical scheme of the embodiment of the invention, structured attribute information is extracted from the description information of the data, wherein the structured attribute information comprises an attribute name and an attribute value corresponding to the attribute name; performing normalization processing on the attribute names by performing similar calculation on the attribute values, and determining key attribute names from the attribute names after the normalization processing; constructing a key attribute knowledge graph according to each key attribute name and the corresponding attribute value thereof; the key attribute extraction is carried out on the data based on the key attribute knowledge graph, so that the efficient key attribute extraction of the unstructured information is carried out on the unstructured information based on the key attribute knowledge graph constructed by processing the structured attribute information, the dependence on grammatical structures and labeled data is not needed, and the efficiency and the accuracy of the key attribute extraction of the data are greatly improved.
Fig. 5 shows an exemplary system architecture 500 to which the data key attribute extraction method or the data key attribute extraction apparatus of the embodiments of the present invention can be applied.
As shown in fig. 5, the system architecture 500 may include terminal devices 501, 502, 503, a network 504, and a server 505. The network 504 serves to provide a medium for communication links between the terminal devices 501, 502, 503 and the server 505. Network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 501, 502, 503 to interact with a server 505 over a network 504 to receive or send messages or the like. The terminal devices 501, 502, 503 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 501, 502, 503 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 505 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the terminal devices 501, 502, 503. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.
It should be noted that the method for extracting the data key attribute provided by the embodiment of the present invention is generally executed by the server 505, and accordingly, the device for extracting the data key attribute is generally disposed in the server 505.
It should be understood that the number of terminal devices, networks, and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use with a terminal device or server implementing an embodiment of the invention is shown. The terminal device or the server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 601.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware. The described units or modules may also be provided in a processor, and may be described as: a processor includes an information extraction module, an attribute processing module, a map construction module, and an attribute extraction module. Where the names of these units or modules do not in some cases constitute a limitation on the units or modules themselves, for example, the information extraction module may also be described as a "module for extracting structured attribute information from description information of data".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: extracting structured attribute information from description information of data, wherein the structured attribute information comprises attribute names and attribute values corresponding to the attribute names; performing normalization processing on the attribute names by performing similar calculation on the attribute values, and determining key attribute names from the attribute names after the normalization processing; constructing a key attribute knowledge graph according to each key attribute name and the corresponding attribute value thereof; and extracting the key attributes of the data based on the key attribute knowledge graph.
According to the technical scheme of the embodiment of the invention, structured attribute information is extracted from the description information of the data, wherein the structured attribute information comprises an attribute name and an attribute value corresponding to the attribute name; performing normalization processing on the attribute names by performing similar calculation on the attribute values, and determining key attribute names from the attribute names after the normalization processing; constructing a key attribute knowledge graph according to each key attribute name and the corresponding attribute value thereof; the key attribute extraction is carried out on the data based on the key attribute knowledge graph, so that the efficient key attribute extraction of the unstructured information is carried out on the unstructured information based on the key attribute knowledge graph constructed by processing the structured attribute information, the dependence on grammatical structures and labeled data is not needed, and the efficiency and the accuracy of the key attribute extraction of the data are greatly improved.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method for extracting key attributes of data is characterized by comprising the following steps:
extracting structured attribute information from description information of data, wherein the structured attribute information comprises attribute names and attribute values corresponding to the attribute names;
performing normalization processing on the attribute names by performing similar calculation on the attribute values, and determining key attribute names from the attribute names after the normalization processing;
constructing a key attribute knowledge graph according to each key attribute name and the corresponding attribute value thereof;
and extracting the key attributes of the data based on the key attribute knowledge graph.
2. The method of claim 1, wherein normalizing the attribute names by performing similarity calculations on the attribute values, and wherein determining key attribute names from the normalized attribute names comprises:
performing similarity calculation on the attribute values, and determining a group of attribute names with the similarity of the attribute values meeting a set threshold as a synonymy attribute name set;
for each synonymous attribute name set, selecting one attribute name from the synonymous attribute name set as an attribute name after normalization processing, and combining and de-duplicating an attribute value corresponding to each attribute name in the synonymous attribute name set to obtain an attribute value corresponding to the attribute name after normalization processing;
and determining the key attribute name according to the importance of the attribute value corresponding to each attribute name after normalization processing to the data.
3. The method of claim 1, wherein before constructing the key attribute knowledge-graph based on each key attribute name and its corresponding attribute value, further comprising:
and for each key attribute name, if the number of the attribute values corresponding to the key attribute name is less than a set number threshold, performing attribute value expansion according to the attribute values corresponding to the key attribute name in an external knowledge base.
4. The method of claim 1, wherein before constructing the key attribute knowledge-graph based on each key attribute name and its corresponding attribute value, further comprising:
for each key attribute name, if the attribute value corresponding to the key attribute name is a composite attribute value, splitting the composite attribute value, and constructing a replacement value set of the composite attribute value.
5. The method of claim 1, wherein before constructing the key attribute knowledge-graph based on each key attribute name and its corresponding attribute value, further comprising:
based on an external knowledge base, respectively acquiring a deactivation word bank, a synonym bank, a superior-inferior relation bank and a key attribute bank of each key attribute name and the corresponding attribute value thereof, wherein the key attribute bank comprises the key attribute name, the corresponding attribute value thereof and a replacement value set of a composite attribute value.
6. The method of claim 5, wherein constructing a key attribute knowledge-graph from each key attribute name and its corresponding attribute value comprises:
and constructing a key attribute knowledge graph according to the stop word library, the synonym library, the upper and lower level relation library and the key attribute library of each key attribute name and the corresponding attribute value thereof.
7. The method of claim 6, wherein performing data key attribute extraction based on the key attribute knowledge-graph comprises:
extracting structured attribute information from description information of data, deleting invalid attribute information from the structured attribute information based on the disused thesaurus to obtain first attribute information, and extracting first key attributes from the first attribute information based on the synonym thesaurus and the key attribute library;
extracting unstructured attribute information from description information of data, and matching the unstructured attribute information based on the synonym library and the key attribute library to obtain a second key attribute, wherein the first key attribute and the second key attribute comprise key attribute names and corresponding attribute values;
and extracting data key attributes from the first key attributes and the second key attributes based on the composite attribute values and the alternative value sets thereof in the key attribute library, the synonym library and the superior-inferior relation library.
8. The method of claim 1, after performing data key attribute extraction based on the key attribute knowledge-graph, further comprising:
and analyzing the effectiveness and uniqueness of the extracted key attributes of the data so as to optimize the key attribute knowledge graph.
9. The method according to claim 1, wherein if the description information of the data is picture data, text recognition is performed on the picture data to obtain text description information, and then data key attributes are extracted from the text description information.
10. An apparatus for extracting key attributes of data, comprising:
the information extraction module is used for extracting structured attribute information from the description information of the data, wherein the structured attribute information comprises an attribute name and an attribute value corresponding to the attribute name;
the attribute processing module is used for carrying out normalization processing on the attribute names by carrying out similarity calculation on the attribute values and determining key attribute names from the attribute names after the normalization processing;
the map building module is used for building a key attribute knowledge map according to each key attribute name and the corresponding attribute value thereof;
and the attribute extraction module is used for extracting the key attributes of the data based on the key attribute knowledge graph.
11. An electronic device for extracting key attributes of data, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.
12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-9.
CN202010312666.6A 2020-04-20 2020-04-20 Method and device for extracting key attributes of data Pending CN113535968A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010312666.6A CN113535968A (en) 2020-04-20 2020-04-20 Method and device for extracting key attributes of data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010312666.6A CN113535968A (en) 2020-04-20 2020-04-20 Method and device for extracting key attributes of data

Publications (1)

Publication Number Publication Date
CN113535968A true CN113535968A (en) 2021-10-22

Family

ID=78123591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010312666.6A Pending CN113535968A (en) 2020-04-20 2020-04-20 Method and device for extracting key attributes of data

Country Status (1)

Country Link
CN (1) CN113535968A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090204588A1 (en) * 2008-02-08 2009-08-13 Fujitsu Limited Method and apparatus for determining key attribute items
CN102495892A (en) * 2011-12-09 2012-06-13 北京大学 Webpage information extraction method
KR20130109601A (en) * 2012-03-28 2013-10-08 (주)탑쿼드란트코리아 Decision method of ontology instance similarity and ontology system using the method
CN105574098A (en) * 2015-12-11 2016-05-11 百度在线网络技术(北京)有限公司 Knowledge graph generation method and device and entity comparing method and device
CN105956052A (en) * 2016-04-27 2016-09-21 青岛海尔软件有限公司 Building method of knowledge map based on vertical field
CN108268581A (en) * 2017-07-14 2018-07-10 广东神马搜索科技有限公司 The construction method and device of knowledge mapping
CN109189942A (en) * 2018-09-12 2019-01-11 山东大学 A kind of construction method and device of patent data knowledge mapping
CN110727741A (en) * 2019-09-29 2020-01-24 全球能源互联网研究院有限公司 Knowledge graph construction method and system of power system
US20200089800A1 (en) * 2018-09-13 2020-03-19 Sap Se Normalization of unstructured catalog data

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090204588A1 (en) * 2008-02-08 2009-08-13 Fujitsu Limited Method and apparatus for determining key attribute items
CN102495892A (en) * 2011-12-09 2012-06-13 北京大学 Webpage information extraction method
KR20130109601A (en) * 2012-03-28 2013-10-08 (주)탑쿼드란트코리아 Decision method of ontology instance similarity and ontology system using the method
CN105574098A (en) * 2015-12-11 2016-05-11 百度在线网络技术(北京)有限公司 Knowledge graph generation method and device and entity comparing method and device
CN105956052A (en) * 2016-04-27 2016-09-21 青岛海尔软件有限公司 Building method of knowledge map based on vertical field
CN108268581A (en) * 2017-07-14 2018-07-10 广东神马搜索科技有限公司 The construction method and device of knowledge mapping
CN109189942A (en) * 2018-09-12 2019-01-11 山东大学 A kind of construction method and device of patent data knowledge mapping
US20200089800A1 (en) * 2018-09-13 2020-03-19 Sap Se Normalization of unstructured catalog data
CN110727741A (en) * 2019-09-29 2020-01-24 全球能源互联网研究院有限公司 Knowledge graph construction method and system of power system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
丰景春;张昕;胡正人;: "水利水电工程索赔属性库构建方法的研究", 水力发电学报, no. 01, 25 February 2013 (2013-02-25), pages 301 - 308 *
侯博议;陈群;杨婧颖;李战怀;: "无监督的中文商品属性结构化方法", 软件学报, no. 02, 15 February 2017 (2017-02-15), pages 81 - 96 *
赵龙文;黄跃萍;: "基于属性值和上下文的开放数据相同属性识别", 情报理论与实践, no. 11, 30 November 2017 (2017-11-30), pages 134 - 138 *

Similar Documents

Publication Publication Date Title
CN113590645B (en) Searching method, searching device, electronic equipment and storage medium
CN107679119B (en) Method and device for generating brand derivative words
US20210374195A1 (en) Information processing method, electronic device and storage medium
CN110020312B (en) Method and device for extracting webpage text
CN107609192A (en) The supplement searching method and device of a kind of search engine
CN113660541A (en) News video abstract generation method and device
CN114579104A (en) Data analysis scene generation method, device, equipment and storage medium
CN111753029A (en) Entity relationship extraction method and device
CN114064925A (en) Knowledge graph construction method, data query method, device, equipment and medium
CN113722600A (en) Data query method, device, equipment and product applied to big data
CN112989235A (en) Knowledge base-based internal link construction method, device, equipment and storage medium
CN117171296A (en) Information acquisition method and device and electronic equipment
CN116597443A (en) Material tag processing method and device, electronic equipment and medium
CN112989190B (en) Commodity mounting method and device, electronic equipment and storage medium
CN112887426B (en) Information stream pushing method and device, electronic equipment and storage medium
CN114461748A (en) Label extraction method, device, storage medium and electronic equipment
CN113535968A (en) Method and device for extracting key attributes of data
CN114491232A (en) Information query method and device, electronic equipment and storage medium
CN114218431A (en) Video searching method and device, electronic equipment and storage medium
CN112528644A (en) Entity mounting method, device, equipment and storage medium
CN113268987B (en) Entity name recognition method and device, electronic equipment and storage medium
CN110851438A (en) Database index optimization suggestion and verification method and device
CN113377922B (en) Method, device, electronic equipment and medium for matching information
US10579696B2 (en) Save session storage space by identifying similar contents and computing difference
CN117216398A (en) Enterprise recommendation method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination