CN113360724A - Method and device for determining property value of article - Google Patents

Method and device for determining property value of article Download PDF

Info

Publication number
CN113360724A
CN113360724A CN202110608525.3A CN202110608525A CN113360724A CN 113360724 A CN113360724 A CN 113360724A CN 202110608525 A CN202110608525 A CN 202110608525A CN 113360724 A CN113360724 A CN 113360724A
Authority
CN
China
Prior art keywords
sample
item
attribute
title
attribute value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110608525.3A
Other languages
Chinese (zh)
Inventor
王奕磊
王刚
佘志东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202110608525.3A priority Critical patent/CN113360724A/en
Publication of CN113360724A publication Critical patent/CN113360724A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention discloses a method and a device for determining an article attribute value, and relates to the technical field of computers. One embodiment of the method comprises: acquiring the item category, the item attribute name and the item title of a target item; inputting the item category, the item attribute name and the item title into the attribute extraction model to obtain an item attribute value of the target item, wherein the item attribute value corresponds to the item attribute name; the attribute extraction model comprises an input layer, a coding layer and an output layer, wherein input parameters of the input layer are the item categories, the item attribute names and character strings spliced by the item titles, the coding layer is used for coding the character strings into a plurality of word vectors, and the output layer determines the labeling information of the item attribute names in the item titles based on the word vectors. This embodiment enables accurate determination of the item attribute values.

Description

Method and device for determining property value of article
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for determining an article attribute value.
Background
On the E-commerce platform, the accuracy of the item attribute value directly or indirectly influences specific business scenarios such as replacement of goods-free items, recommendation of similar items and the like. The quality of the existing article attribute data is often poor, and is mainly represented by error attribute values, non-standard attribute values, missing attribute values and the like. However, there is no method for accurately determining the property value of the article in the prior art.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for determining an article attribute value, which can accurately determine the article attribute value.
In a first aspect, an embodiment of the present invention provides a method for determining an item attribute value, including:
acquiring the item category, the item attribute name and the item title of a target item;
inputting the item category, the item attribute name and the item title into the attribute extraction model to obtain an item attribute value of the target item, wherein the item attribute value corresponds to the item attribute name; the attribute extraction model comprises an input layer, a coding layer and an output layer, wherein input parameters of the input layer are the item categories, the item attribute names and character strings spliced by the item titles, the coding layer is used for coding the character strings into a plurality of word vectors, and the output layer determines the labeling information of the item attribute names in the item titles based on the word vectors.
Optionally, before determining the item attribute value of the target item from the item title, the method further includes:
obtaining sample information for a plurality of samples, the sample information comprising: sample category, sample title, sample attribute name and its corresponding sample attribute value;
generating sample data according to the sample information;
and training a machine learning model by using a plurality of sample data to generate an attribute extraction model.
Optionally, the generating sample data according to the plurality of sample information includes:
acquiring sample information of a current sample;
according to the sample attribute value of the current sample, carrying out labeling processing on the sample title of the current sample;
generating sample data of the current sample, wherein the sample data comprises: sample category, sample title and sample attribute name after labeling processing.
Optionally, the labeling, according to the sample attribute value of the current sample, the sample title of the current sample, includes:
performing title preprocessing on a sample title of the current sample;
performing attribute preprocessing on the sample attribute value of the current sample;
matching the processed sample attribute value in the processed sample title to obtain the labeling information corresponding to the sample attribute name of the current sample, wherein the labeling information comprises: a start position and an end position.
Optionally, the labeling, according to the sample attribute value of the current sample, the sample title of the current sample, includes:
according to the sample attribute value of the current sample, carrying out labeling processing on the sample title of the current sample according to a preset labeling strategy, wherein the preset labeling strategy comprises at least one of the following strategies: a specification attribute hard matching strategy, a maximum substring matching strategy, a regular expression strategy and a soft matching strategy.
Optionally, the generating sample data according to the plurality of sample information includes:
according to the sample information, positive sample data and negative sample data are constructed, wherein the positive sample data are sample data of an attribute value corresponding to an article attribute name in an article title, and the negative sample data are sample data of an attribute value corresponding to an article attribute name which is not contained in the article title;
the training of the machine learning model by using a plurality of sample data to generate the attribute extraction model comprises:
and training a machine learning model by using the plurality of positive sample data and the plurality of negative sample data to generate the attribute extraction model.
Optionally, the Machine learning model is a model constructed by using an MRC (Machine Reading understanding) model and a pointer network model.
In a second aspect, an embodiment of the present invention provides an apparatus for determining an article attribute value, including:
the information acquisition module is used for acquiring the item category, the item attribute name and the item title of the target item;
an attribute value determination module, configured to input the item category, the item attribute name, and the item title into the attribute extraction model to obtain an item attribute value of the target item, where the item attribute value corresponds to the item attribute name; the attribute extraction model comprises an input layer, a coding layer and an output layer, wherein input parameters of the input layer are the item categories, the item attribute names and character strings spliced by the item titles, the coding layer is used for coding the character strings into a plurality of word vectors, and the output layer determines the labeling information of the item attribute names in the item titles based on the word vectors.
In a third aspect, an embodiment of the present invention provides an electronic device, including:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any of the embodiments described above.
In a fourth aspect, an embodiment of the present invention provides a computer-readable medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method of any one of the above embodiments.
One embodiment of the above invention has the following advantages or benefits: a large number of reliable and rapidly updatable attributes exist in the item title, and attribute values can be determined more accurately from the item title. The categories can be used for representing the categories to which the articles belong, different articles are clearly distinguished and systematized by classifying the different articles into a plurality of categories, and the correctness of the obtained attribute value can be further ensured by considering the article categories when the attribute value is determined. In addition, the attribute identification model provided by the embodiment of the invention comprises an input layer, a coding layer and an output layer, and also has a better attribute identification effect.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
fig. 1 is a schematic diagram illustrating a flow of a method for determining an item attribute value according to an embodiment of the present invention;
FIG. 2 is a block diagram of a framework for determining values of attributes of an item according to one embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating a flow of a method for generating an attribute extraction model according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an attribute extraction model provided by an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an apparatus for determining an article attribute value according to an embodiment of the present invention;
fig. 6 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The extraction of the property values of the article can be realized by the following two techniques. One is NER (Named Entity Recognition) sequence labeling, each sequence position is labeled as a tag (item attribute name), labeled with BIO, decoded using MLP (Multi-Layer neural network) or CRF (Conditional Random Field) model. The other type is MRC (Machine Reading Comprehension) -QA (Question answering) and pointer labeling, a query (Question) is constructed to indicate the type of an entity (item attribute name) to be extracted, a priori semantic knowledge counting is also introduced, and then the starting position and the ending position of each entity fragment (item attribute value) are labeled.
The extraction of the property value of the article faces two problems which are difficult to land. Firstly, the scale is large, the types of attribute entities are rich, each primary category has a large number of attribute names, and the category cannot be expanded to thousands of attributes required in an actual scene. The existing scheme can independently construct a model for a single attribute or a three-level category, and is not suitable for a large-scale attribute extraction system. Second, to generalize, existing techniques are not suitable for new attribute and attribute value extraction. The update iteration of a new product is fast, and the model needs to have the capability of being suitable for the discovery of new attribute types and the extraction of new attribute values.
The two technologies for realizing the extraction of the attribute values of the articles respectively have the following defects: the expansion of new attributes is difficult to carry out, a certain upper limit exists for the number of entities, the CRF matrix parameters are greatly increased, the training is difficult, the coding capability of the model is weak, and certain misjudgment exists for the same attribute names or attribute values of different types of objects.
In view of the need of being able to take into account both scale and extensibility, the embodiment of the present invention provides an attribute extraction scheme that has strong new attribute extensibility and is suitable for various article categories. Fig. 1 is a schematic diagram of a flow of a method for determining an item attribute value according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step 101: and acquiring the item category, the item attribute name and the item title of the target item.
The category of items may be used to characterize the category to which the item belongs. When the e-commerce platform manages the article data, the e-commerce platform generally needs to classify the articles, that is: the basic characteristics of the selected proper articles are used as classification marks to classify the articles, and the articles are sequentially classified into a plurality of sub-aggregates (namely, categories) with smaller range and more consistent characteristics, such as large categories, medium categories, small categories and fine categories, to varieties, fine categories and the like, so that all the articles are clearly distinguished and systematized.
An item attribute generally refers to a specific characteristic of the item itself. The item attribute name is the name of the item attribute, and the item attribute value is the specific value of the item attribute. For example, the article attribute name is fabric, and the article attribute value corresponding to the article attribute name is pure cotton.
Step 102: and inputting the item category, the item attribute name and the item title into the attribute extraction model to obtain an item attribute value of the target item, wherein the item attribute value corresponds to the item attribute name.
The attribute extraction model comprises an input layer, a coding layer and an output layer, wherein input parameters of the input layer are character strings spliced by article categories, article attribute names and article titles, the coding layer is used for coding the character strings into a plurality of word vectors, and the output layer determines the labeling information of the article attribute names in the article titles on the basis of the word vectors.
The attribute extraction model can be obtained by adopting various machine learning models for training. Machine learning models such as decision trees, logistic regression, bayesian, neural networks, random forests or support vector machines, etc. And determining the attribute value corresponding to the current attribute name from the multiple word segments through the classification model. The machine learning model can also select an NER model, an MLP model, an MRC model and the like, and determines an attribute value corresponding to the current attribute name from a plurality of word segments based on semantic analysis.
Specific characters can be inserted among the item categories, the item attribute names and the item titles so as to splice into character strings. The coding layer of the model firstly carries out word segmentation processing on the input character string object and then vectorizes each word segmentation to obtain a plurality of word vectors. The output layer determines labeling information of the item attribute name in the item title based on the plurality of word vectors, wherein the labeling information comprises: a start position and an end position. The starting position represents the starting position of the attribute value matched by the item attribute name in the item title, and the ending position represents the ending position of the attribute value matched by the item attribute name in the item title.
It should be noted that, multiple categories, such as a first category, a second category, a third category, etc., may be set in the system. The item categories referred to in the embodiments of the present invention may be categories of any level. The lower the class level of the article input by the attribute extraction model is, the higher the accuracy of the attribute extraction model is, the more complex the attribute extraction model is, and the longer the training time is.
In the embodiment of the invention, a large number of reliable and rapidly-updatable attributes exist in the title of the article, and the attribute value can be more accurately determined from the title of the article. In addition, the category can be used for representing the category to which the article belongs, different articles can be clearly distinguished and systematized by classifying the different articles into a plurality of categories, and when the attribute value is determined, the accuracy of the obtained attribute value can be further ensured by considering the article category.
In an embodiment of the present invention, before determining the item attribute value of the target item from the item title, the method further includes: obtaining sample information for a plurality of samples, the sample information comprising: sample category, sample title, sample attribute name and its corresponding sample attribute value; generating sample data according to the sample information; and training the machine learning model by using a plurality of sample data to generate an attribute extraction model.
An attribute extraction model is trained through a plurality of samples, and then the attribute extraction model is utilized to obtain an article attribute value. FIG. 2 is a block diagram of a framework for determining values of attributes of an item according to an embodiment of the present invention. As shown in FIG. 2, A in the sample represents the attribute name (and more information), X represents the item title, and Y represents the attribute value location. The model outputs a prediction of Y given a and X inputs, where Y consists of S and E, S representing the starting position of the attribute value in the item title and E representing the ending position of the attribute value in the item title.
Specifically, word segmentation is performed on the title of the article, then each word segmentation is vectorized, and the article attribute value is extracted from each quantitative word segmentation by using a machine learning model. The machine learning model may select a variety of models, such as decision trees, logistic regression, bayesian, neural networks, random forests, or support vector machines, etc. And determining the attribute value corresponding to the current attribute name from the multiple word segments through the classification model. The machine learning model can also select an NER model, an MLP model, an MRC model and the like, and determines an attribute value corresponding to the current attribute name from a plurality of word segments based on semantic analysis.
Fig. 3 is a schematic diagram of a flow of an attribute extraction model generation method according to an embodiment of the present invention, as shown in fig. 3, the method includes:
step 301: obtaining sample information of a current sample, wherein the sample information comprises: sample category, sample title, sample attribute name and its corresponding sample attribute value.
The information such as the sample category, the sample title, the sample attribute name and the corresponding sample attribute value of the current sample can be directly extracted from the system, and also can be crawled from a website of an e-commerce platform by utilizing a crawler program.
Step 302: and performing labeling processing on the sample title of the current sample according to the sample attribute value of the current sample.
The labeling process is to label the sample attribute value in the sample header of the current sample. If the sample title does not contain a sample attribute value, no labeling process is performed on the sample title or the sample is labeled as a negative sample.
Step 303: generating sample data of a current sample, wherein the sample data comprises: sample category, sample title and sample attribute name after labeling processing.
Step 304: and training the machine learning model by using a plurality of sample data to generate an attribute extraction model.
The machine learning model may select a variety of models, such as decision trees, logistic regression, bayesian, neural networks, random forests, or support vector machines, etc. And determining the attribute value corresponding to the current attribute name from the multiple word segments through the classification model. The machine learning model can also select an NER model, an MLP model, an MRC model and the like, and determines an attribute value corresponding to the current attribute name from a plurality of word segments based on semantic analysis.
In the embodiment of the invention, the current sample information is acquired, the sample data is constructed, and then the machine learning model is trained by utilizing the sample data to generate the attribute extraction model. It should be noted that, because the attributes of the articles in the categories are greatly different, the different categories often have the same attribute name. Different attribute extraction models can be constructed based on the same category or similar categories so as to improve the accuracy of attribute determination.
In an embodiment of the present invention, labeling a sample title of a current sample according to a sample attribute value of the current sample includes: performing title preprocessing on a sample title of a current sample; performing attribute preprocessing on a sample attribute value of a current sample; matching the processed sample attribute value in the processed sample title to obtain labeling information corresponding to the sample attribute name of the current sample, wherein the labeling information comprises: a start position and an end position.
And performing title preprocessing on the sample attribute value of the current sample to improve the accuracy of labeling. For example, special pre-processing may be performed for certain specific attributes, such as model, color, and the like. For example, the attribute value of the model must satisfy a certain rule, such as failing to contain Chinese and spaces, and the attribute preprocessing may include removing Chinese and spaces from the model attribute value. For another example, a batch of seed words of common colors are collected, vector representation is performed through a sensor bert model, if the current color attribute value is closer to the center of the color vector, the color attribute is determined to be a legal color attribute, and otherwise, the color attribute is determined to be a non-color word.
And performing title preprocessing on the sample title of the current sample to improve the accuracy of labeling. The title preprocessing may include removing unreasonable characters from the title of the item, such as activity information.
In an embodiment of the present invention, labeling a sample title of a current sample according to a sample attribute value of the current sample includes: according to the sample attribute value of the current sample, carrying out labeling processing on the sample title of the current sample according to a preset labeling strategy, wherein the preset labeling strategy comprises at least one of the following: a specification attribute hard matching strategy, a maximum substring matching strategy, a regular expression strategy and a soft matching strategy.
The training samples are automatically labeled based on rules and a labeling strategy of remote supervision. The specification attribute hard matching means that an attribute value corresponding to an article attribute is directly taken, the attribute value is searched in an article title, and if the current position is marked, an attribute name with higher frequency of occurrence is reserved for the current position.
And the maximum substring matching strategy is used for matching a plurality of attribute values from the item title aiming at the attribute name, and determining the attribute value with the maximum length as the attribute value corresponding to the attribute name. And if the length of the maximum substring corresponding to the attribute value in the item title exceeds a certain proportion of the attribute value, marking. It may be that the merchant will repeatedly pile up similar (identical) words in the title of the item. Exceeding a certain ratio can ensure the accuracy of the attribute values. The role of this module is to enable the matching of the attribute values to the segments in the item title as much as possible. If the attribute name is the attribute and the title of the article comprises the character string 'xxl', the attribute value of the article which can be matched comprises; "l", "xl", "xxl", based on the maximum substring match, the final matched item attribute value may be determined to be "xxl".
For another example, an attribute pair of "cpu model-5 g kylin 990" exists in the article title "rongye 30pro 50-time telephoto kylin 9905 g 4000 ten thousand super-sensitive photographs 3200w beautiful face self-shooting game mobile phone full-network communication edition 8gb +128gb streamer magic mirror", direct matching fails, and the attribute value "kylin 990" can be recalled through substring matching.
And the regular expression strategy is used for constructing a universal attribute value extraction regular expression according to template mining, and matching the attribute value from the article title according to the regular expression. Suitable attributes are screen size, clothing size, etc. Some digital attributes can be extracted from the title accurately by proper summarization rules. For example, by extracting a character string "xxl" from the title using a regular expression, the character string can be determined to be an attribute value corresponding to the size of clothing. If the object is a "Zhongxing Axon 30Pro 6400 Wan double main shooting 120HZ screen 8GB +128GB obsidian black cellcept 88855W quick charging A30Pro shooting game 5G mobile phone", the running memory and the storage capacity attribute of the mobile phone often appear together, and the mobile phone can be accurately extracted and labeled by regular mode and then generalized by a model.
The soft matching strategy is to set a default value for the attribute, or predict the predicted value corresponding to the target attribute name by using other known attribute values of the article, and fill the missing attribute value by using the default value or the predicted value when the article title lacks the attribute value. And counting attribute values of the specification attributes or manually evaluating and concentrating the attribute values only bound with a certain attribute, and carrying out reverse labeling to improve label missing. If the object title is 'keklel millet 10 mobile phone shell millet 10 protective sleeve newly-upgraded full-package lens liquid silica gel protective shell skin feel anti-falling ultrathin soft shell black', the attribute of 'wrapping degree' is lacked in the object title, but because the attribute of 'wrapping degree-full package' is higher in the frequency of appearance in other skus (Stock keeping Unit), the 'full package' can be directly marked as the default attribute value of the attribute name of 'wrapping degree', so that the coverage degree of the marking is improved.
In one embodiment of the present invention, generating sample data according to a plurality of sample information includes: according to the sample information, positive sample data and negative sample data are constructed, wherein the positive sample data are sample data of the attribute value corresponding to the attribute name of the article in the title of the article, and the negative sample data are sample data of the attribute value corresponding to the attribute name of the article which is not contained in the title of the article; training a machine learning model by using a plurality of sample data, and generating an attribute extraction model, wherein the method comprises the following steps: and training the machine learning model by using the plurality of positive sample data and the plurality of negative sample data to generate an attribute extraction model. The adaptability of the attribute extraction model can be improved by utilizing a plurality of positive sample data and a plurality of negative sample data, and the attribute extraction quality is further ensured.
Fig. 4 is a schematic structural diagram of an attribute extraction model according to an embodiment of the present invention. As shown in fig. 4, the machine learning model is a model constructed based on the MRC model and the pointer network model. MRC is a technique that uses algorithms to make computing mechanisms solve article semantics and answer related questions. Specifically, after inputting the user question and the related text document, the MRC-based model may automatically extract a continuous text interval from the paragraph according to the calculation result to be output as the answer to the user question. The pointer network model can also be replaced by a multi-head label or fragment arrangement model.
As shown in fig. 4, the input text sequence is a text spliced by the item category and the attribute name as a query question, and the item title as an answer context. The input sequence is encoded into a vector representation after passing through a pre-trained language model, such as the Roberta-wwm-ext model.
The embodiment of the invention carries out certain constraint on the pointer network to ensure that the generated answer interval is in a reasonable range. The start position must be in the item header and the end position must be between the predicted start position and the end of sequence marker 'SEP'. When the ending position is calculated, the vector of the token corresponding to the starting position is superposed on the vector of each possible ending position token. The penalty function for this task is defined as the average of the cross entropy penalty for the start and end positions.
The model of the embodiment of the invention belongs to the field of multi-task learning, and the CLS vector is selected to be used for a No-Answer binary task to judge whether the title of the current article contains an attribute value corresponding to the attribute name.
The core idea of the scheme of the embodiment of the invention is to combine the prior input of the category text and the attribute name text, so that the expansion capability and the attack resistance of the model are obviously improved. To facilitate understanding of the aspects of the embodiments of the present invention and to illustrate the effects of the embodiments of the present invention, a description will be given below of a specific embodiment for determining an article attribute value. The scheme of the embodiment of the invention comprises three parts: data preparation, model construction, online prediction.
A first part: and (3) preparing data, namely collecting article titles and original attribute lists of different categories from an active article list, an article attribute list and an article extended attribute list of the Jingdong mall, and classifying the articles according to self-supporting articles and third-party articles. The self-supporting article is selected to train the model, and the integrity and the accuracy of the attribute of the article are considered to be high.
And preprocessing the object title, the attribute name and the attribute value of the object to improve the accuracy of model extraction. Based on the data obtained from the screening and preprocessing, a sample is constructed for model training.
The data set is classified by 90: the 10 scale is split into a training set and a validation set. Because the training time of the model is too long due to the fact that the quantity of certain first-level categories is too large, single-pass clustering is conducted on the sku titles according to the third-level categories and the brands, a small number of skus of each cluster are selected as representative samples of the cluster, so that the size of a training set is greatly reduced (the average size is reduced by more than 50%), the richness of the samples can be ensured, and experimental results are given subsequently to verify the value of the step.
And the samples of the manual evaluation set are selected from hot self-operation articles TOP10 of each third class, the attribute values of the articles are complete, the number of the samples of each class can be 3-10, and strict manual marking is carried out. In addition, the difference between the attribute name-attribute value pair in the title and the existing specification attribute needs to be additionally marked, the attribute name-attribute value pair is divided into three types, namely, Same, New and Amend, which respectively represent the attribute which is the Same as the specification attribute and is missing in the specification attribute, and the attribute name (value) needs to be corrected. Among them, 9987 represents the cell phone/communication category, 1315 represents the apparel underwear category, and 11729 represents the footwear category. Through statistics, the average missing proportion of the attributes of the three primary categories is 36.23%, and the error proportion of the attribute values is 8.41%. This also represents the importance of the attribute extraction module from the data plane.
And the manual evaluation set sample is used for verifying the output result of the model subsequently. The manual evaluation set sample can accurately reflect the data set represented by the model. Meanwhile, attribute missing and error proportions are marked out to show the working value of extracting attributes from the title of the article.
TABLE 1 labeling results of the Manual evaluation set
Figure BDA0003095026840000121
A second part: models were constructed and embodiments of the present invention used machine learning models as shown in FIG. 4 for training.
The technical indexes adopted by the embodiment of the invention are as follows: answer Acc is used for representing the accuracy of whether the title of the article contains information of a certain attribute name. F1 was used to characterize the standard evaluation index in the NER.
The service indexes adopted by the embodiment of the invention are as follows: and the New-Recall is used for representing the Recall rate of the attribute names newly added in the manual labeling set. Amend-Recall is used to characterize the Recall rate of the revised attributes in the manual annotation set.
The model training process is as follows: the model is trained according to the first-class granularity, and when F1 scores of the verification set are continuously evaluated for multiple times and decline occurs, the model training is stopped. Each experiment was repeated 10 times and the final results averaged and are shown in table 2.
TABLE 2 model evaluation index Table
Figure BDA0003095026840000131
Figure BDA0003095026840000141
In Table 2, BERT-NER-CRF is used to represent the NER model. JAVE (JD Attribute Values extraction) is a model obtained by training based on the machine learning model structure shown in fig. 4 according to the embodiment of the present invention. JAVE w.o cat _ text is a model of JAVE with category text removed. JAVE w.o no-answer is a model generated by removing modules in JAVE that predict whether an answer (i.e., an attribute value) is in the title of an item.
As can be seen from table 2, the NER model has little predictive ability for unseen attribute names, and is not effective for attribute correction. JAVE has better generalization capability and has different degrees of expansion capability in three primary categories.
In table 2, 11729-Cluster represents a representative sample data set selected from a plurality of articles under the 11729 category, and the representative sample data set is used as a data set for model evaluation. The representative samples were determined as follows: clustering is carried out on a plurality of articles under the 11729 category, and then a small number of articles are selected from each cluster to be used as representative samples under the 11729 category. 11729-Full represents the complete data set including all items under the 11729 category and is the data set evaluated by the model.
9987-Cluster represents a data set which is selected from a plurality of articles under the category of 9987 and is used as model evaluation. The representative samples were determined as follows: clustering a plurality of articles under the category 9987, and selecting a small number of articles from each cluster as representative samples under the category 9987. 9987-Full represents the complete data set including all items under the category 9987 and is evaluated as a model.
Comparing the results of model training using the Full data sets (11729-Full and 9987-Full) and the clustered representative sample data sets (11729-Cluster and 9987-Cluster) in table 2, the representative sample data sets of the 11729 and 9987 categories performed better, possibly because the representative sample data sets appeared fewer times due to false labeling in the representative sample data sets due to remote supervision. In the embodiment of the invention, the subsequent experiments all adopt the clustered representative sample data set, so that on one hand, the accuracy of model identification can be improved, and on the other hand, the efficiency of model training can be improved due to less data in the representative sample data set.
Ablation experiment: compared with JAVE and JAVE w.o cat _ text, when the category text is removed, indexes of the clothes, underwear 1315 and shoes and boots 11729 are obviously reduced, and the category change of the mobile phone communication 9987 is not obvious. The analysis reason can be known that the coincidence degree between the attribute names of the three-level categories 1315 and 11729 is high, after the category text is removed, the model is difficult to accurately extract the corresponding attribute values, and the attribute names of the three-level categories under the category 9987 have large difference and small influence degree.
Comparing JAVE with JAVE w.o No-answer, the positive sample of the No-answer module is marked with the marked attribute name, and the mark is 1, which indicates that the attribute value exists. And the negative sample is the attribute name left after removing the attribute name with the semantic similar to the labeled attribute name, and is randomly selected and marked as 0 to represent that the attribute value does not exist. Certain noise exists in the construction of the negative sample, so that the lifting amplitude of the model is small.
And a third part: and performing online prediction, namely acquiring a key attribute list of a plurality of online articles of the system, and splicing the attribute names and article titles during prediction. The key attributes are obtained by combining two parts, one is an attribute name list which appears at high frequency in the statistical specification attributes and the user comments according to three-level category granularity, and the other is an attribute name list which is output at high frequency in the manual marking.
Model prediction: and performing an online attribute extraction task by adopting a mode of first full prediction and weekly incremental prediction updating. The incremental update includes two parts of the item title change and the newly added item, and the final result is shown in table 3.
TABLE 3 prediction result information Table
Figure BDA0003095026840000151
Figure BDA0003095026840000161
In Table 3, JAVE-1315+ cat represents the JAVE model with 1315 category information added. JAVE-1315 represents a JAVE model without added 1315 category information. JAVE-1315+11729+ cat represents the JAVE model with 1315 and 11729 category information added. JAVE-1315+11729 represents the JAVE model without added 1315 and 11729 category information.
As can be seen from Table 3, the JAVE model is a greater improvement over the NER model in attribute supplementation and correction in the first set of predicted results. In the second group of prediction results, the model added with category information accurately extracts the attribute value of the fabric, the flannel fabric is common for shirts, and almost all clothing categories of pure cotton can appear. In the third set of predictions, the item title is added with noise text that adds the size attribute value xxl for the first item title to the apparel. Under the condition of mixing the linguistic data of the two categories 1315 and 11729, the model added with the category text can accurately identify the attribute values, such as the size, corresponding to the same attribute name of different categories at the first level, and the attribute values are predicted to be the size attribute values which are frequently appeared in the clothing categories under the condition of no prompting of category information.
In summary, the embodiment of the present invention provides an attribute determination scheme that has a strong attribute expansion capability and is suitable for various commodity categories. When facing new attributes, the knowledge generalization capability of the model is utilized, the extraction capability is greatly improved compared with the NER system, and the attribute correction and supplement are also greatly improved. Even under the condition that the same attribute or the current attribute of different categories has no attribute value, the attribute value can be accurately determined.
Fig. 5 is a schematic structural diagram of an apparatus for determining an article attribute value according to an embodiment of the present invention. As shown in fig. 5, the apparatus includes:
an information obtaining module 501, configured to obtain an item category, an item attribute name, and an item title of a target item;
an attribute value determination module 502, configured to input the item category, the item attribute name, and the item title into the attribute extraction model to obtain an item attribute value of the target item, where the item attribute value corresponds to the item attribute name; the attribute extraction model comprises an input layer, a coding layer and an output layer, wherein input parameters of the input layer are the item categories, the item attribute names and character strings spliced by the item titles, the coding layer is used for coding the character strings into a plurality of word vectors, and the output layer determines the labeling information of the item attribute names in the item titles based on the word vectors.
Optionally, the apparatus further comprises:
a model generating module 503, configured to obtain sample information of a plurality of samples, where the sample information includes: sample category, sample title, sample attribute name and its corresponding sample attribute value;
generating sample data according to the sample information;
and training a machine learning model by using a plurality of sample data to generate an attribute extraction model.
Optionally, the model generation module 503 is specifically configured to:
acquiring sample information of a current sample;
according to the sample attribute value of the current sample, carrying out labeling processing on the sample title of the current sample;
generating sample data of the current sample, wherein the sample data comprises: sample category, sample title and sample attribute name after labeling processing.
Optionally, the model generation module 503 is specifically configured to:
performing title preprocessing on a sample title of the current sample;
performing attribute preprocessing on the sample attribute value of the current sample;
matching the processed sample attribute value in the processed sample title to obtain the labeling information corresponding to the sample attribute name of the current sample, wherein the labeling information comprises: a start position and an end position.
Optionally, the model generation module 503 is specifically configured to:
according to the sample attribute value of the current sample, carrying out labeling processing on the sample title of the current sample according to a preset labeling strategy, wherein the preset labeling strategy comprises at least one of the following strategies: a specification attribute hard matching strategy, a maximum substring matching strategy, a regular expression strategy and a soft matching strategy.
Optionally, the model generation module 503 is specifically configured to:
according to the sample information, positive sample data and negative sample data are constructed, wherein the positive sample data are sample data of an attribute value corresponding to an article attribute name in an article title, and the negative sample data are sample data of an attribute value corresponding to an article attribute name which is not contained in the article title;
and training the machine learning model by using a plurality of positive sample data and a plurality of negative sample data to generate the attribute extraction model.
Optionally, the machine learning model is a model constructed based on an MRC model and a pointer network model.
An embodiment of the present invention provides an electronic device, including:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method of any of the embodiments described above.
Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use with a terminal device implementing an embodiment of the invention is shown. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 601.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: the device comprises an information acquisition module and an attribute value determination module. The names of these modules do not in some cases constitute a limitation to the modules themselves, and for example, the information acquisition module may also be described as a "module that acquires an item category, an item attribute name, and an item title of a target item".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:
acquiring the item category, the item attribute name and the item title of a target item;
inputting the item category, the item attribute name and the item title into the attribute extraction model to obtain an item attribute value of the target item, wherein the item attribute value corresponds to the item attribute name; the attribute extraction model comprises an input layer, a coding layer and an output layer, wherein input parameters of the input layer are the item categories, the item attribute names and character strings spliced by the item titles, the coding layer is used for coding the character strings into a plurality of word vectors, and the output layer determines the labeling information of the item attribute names in the item titles based on the word vectors.
According to the technical scheme of the embodiment of the invention, a large number of reliable and rapidly-updatable attributes exist in the title of the article, and the attribute value can be more accurately determined from the title of the article. In addition, the category can be used for representing the category to which the article belongs, different articles can be clearly distinguished and systematized by classifying the different articles into a plurality of categories, and when the attribute value is determined, the accuracy of the obtained attribute value can be further ensured by considering the article category.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for determining an item attribute value, comprising:
acquiring the item category, the item attribute name and the item title of a target item;
inputting the item category, the item attribute name and the item title into an attribute extraction model to obtain an item attribute value of the target item, wherein the item attribute value corresponds to the item attribute name; the attribute extraction model comprises an input layer, a coding layer and an output layer, wherein input parameters of the input layer are the item categories, the item attribute names and character strings spliced by the item titles, the coding layer is used for coding the character strings into a plurality of word vectors, and the output layer determines the labeling information of the item attribute names in the item titles based on the word vectors.
2. The method according to claim 1, wherein before entering the attribute extraction model to obtain the item attribute value of the target item, the method further comprises:
obtaining sample information for a plurality of samples, the sample information comprising: sample category, sample title, sample attribute name and its corresponding sample attribute value;
generating sample data according to the sample information;
and training a machine learning model by using a plurality of sample data to generate the attribute extraction model.
3. The method of claim 2, wherein generating sample data from the plurality of sample information comprises:
acquiring sample information of a current sample;
according to the sample attribute value of the current sample, carrying out labeling processing on the sample title of the current sample;
generating sample data of the current sample, wherein the sample data comprises: sample category, sample title and sample attribute name after labeling processing.
4. The method according to claim 3, wherein the labeling the sample header of the current sample according to the sample attribute value of the current sample comprises:
performing title preprocessing on a sample title of the current sample;
performing attribute preprocessing on the sample attribute value of the current sample;
matching the processed sample attribute value in the processed sample title to obtain the labeling information corresponding to the sample attribute name of the current sample, wherein the labeling information comprises: a start position and an end position.
5. The method according to claim 3, wherein the labeling the sample header of the current sample according to the sample attribute value of the current sample comprises:
according to the sample attribute value of the current sample, carrying out labeling processing on the sample title of the current sample according to a preset labeling strategy, wherein the preset labeling strategy comprises at least one of the following strategies: a specification attribute hard matching strategy, a maximum substring matching strategy, a regular expression strategy and a soft matching strategy.
6. The method of claim 2, wherein generating sample data from the plurality of sample information comprises:
according to the sample information, positive sample data and negative sample data are constructed, wherein the positive sample data are sample data of an attribute value corresponding to an article attribute name in an article title, and the negative sample data are sample data of an attribute value corresponding to an article attribute name which is not contained in the article title;
the training of the machine learning model by using a plurality of sample data to generate the attribute extraction model comprises:
and training the machine learning model by using a plurality of positive sample data and a plurality of negative sample data to generate the attribute extraction model.
7. The method of claim 1, wherein the machine learning model is a model constructed based on an MRC model and a pointer network model.
8. An apparatus for determining an attribute value of an article, comprising:
the information acquisition module is used for acquiring the item category, the item attribute name and the item title of the target item;
the attribute value determining module is used for inputting the item category, the item attribute name and the item title into an attribute extraction model to obtain an item attribute value of the target item, wherein the item attribute value corresponds to the item attribute name; the attribute extraction model comprises an input layer, a coding layer and an output layer, wherein input parameters of the input layer are the item categories, the item attribute names and character strings spliced by the item titles, the coding layer is used for coding the character strings into a plurality of word vectors, and the output layer determines the labeling information of the item attribute names in the item titles based on the word vectors.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202110608525.3A 2021-06-01 2021-06-01 Method and device for determining property value of article Pending CN113360724A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110608525.3A CN113360724A (en) 2021-06-01 2021-06-01 Method and device for determining property value of article

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110608525.3A CN113360724A (en) 2021-06-01 2021-06-01 Method and device for determining property value of article

Publications (1)

Publication Number Publication Date
CN113360724A true CN113360724A (en) 2021-09-07

Family

ID=77530859

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110608525.3A Pending CN113360724A (en) 2021-06-01 2021-06-01 Method and device for determining property value of article

Country Status (1)

Country Link
CN (1) CN113360724A (en)

Similar Documents

Publication Publication Date Title
CN109582949B (en) Event element extraction method and device, computing equipment and storage medium
CN110162593B (en) Search result processing and similarity model training method and device
US11520812B2 (en) Method, apparatus, device and medium for determining text relevance
CN111177569B (en) Recommendation processing method, device and equipment based on artificial intelligence
WO2020082569A1 (en) Text classification method, apparatus, computer device and storage medium
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN109471938A (en) A kind of file classification method and terminal
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
CN108595708A (en) A kind of exception information file classification method of knowledge based collection of illustrative plates
CN108182175B (en) Text quality index obtaining method and device
CN111159485B (en) Tail entity linking method, device, server and storage medium
US20080243905A1 (en) Attribute extraction using limited training data
CN108959474B (en) Entity relation extraction method
CN109214407B (en) Event detection model, method and device, computing equipment and storage medium
CN108549723B (en) Text concept classification method and device and server
CN111400584A (en) Association word recommendation method and device, computer equipment and storage medium
CN107368489A (en) A kind of information data processing method and device
CN111709225A (en) Event cause and effect relationship judging method and device and computer readable storage medium
CN112667819A (en) Entity description reasoning knowledge base construction and reasoning evidence quantitative information acquisition method and device
CN110888983A (en) Positive and negative emotion analysis method, terminal device and storage medium
CN116975340A (en) Information retrieval method, apparatus, device, program product, and storage medium
CN113360724A (en) Method and device for determining property value of article
Xu et al. Estimating similarity of rich internet pages using visual information
CN113656556A (en) Text feature extraction method and knowledge graph construction method
CN116227496B (en) Deep learning-based electric public opinion entity relation extraction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination