CN113052661B

CN113052661B - Method and device for acquiring attribute information, electronic equipment and storage medium

Info

Publication number: CN113052661B
Application number: CN202110400102.2A
Authority: CN
Inventors: 王玥; 李浩然; 祝天刚
Original assignee: Jingdong Technology Holding Co Ltd
Current assignee: Jingdong Technology Holding Co Ltd
Priority date: 2021-04-14
Filing date: 2021-04-14
Publication date: 2024-04-09
Anticipated expiration: 2041-04-14
Also published as: CN113052661A

Abstract

The application provides a method and a device for acquiring attribute information, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring target text data of a target object and target image data with an association relationship with the target text data; extracting target semantic features of the target text data and extracting target image features of the target image data; and predicting the target attribute of the target object by using the target image feature and the target semantic feature to obtain target attribute information of the target attribute. By the method and the device, the problem that the efficiency of attribute information filling is low due to inaccurate information extraction in the mode of attribute value filling in the related technology is solved.

Description

Method and device for acquiring attribute information, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computers, and in particular, to a method and apparatus for acquiring attribute information, an electronic device, and a storage medium.

Background

In the e-commerce scene, the product attribute information is important to tasks such as customer service, commodity recommendation, commodity retrieval and the like. However, due to information entry and the like, the phenomenon of missing product attribute information of some products is very serious. For example, when a user asks a customer service person for a collar type of clothing, if there is no description of the collar type in the commodity attribute table, it is difficult for the customer service person to answer the question.

Currently, most of the ways of filling product attribute values are focused on filling by using the description information of the product. However, the above-described attribute value alignment method has a problem in that the efficiency of the attribute information alignment is low due to inaccurate information extraction, due to the product description method and the like.

Disclosure of Invention

The application provides a method and a device for acquiring attribute information, electronic equipment and a storage medium, which at least solve the problem that the efficiency of attribute information filling is low due to inaccurate information extraction in a mode of filling attribute values in related technologies.

According to an aspect of the embodiments of the present application, there is provided a method for acquiring attribute information, including: acquiring target text data of a target object and target image data with an association relationship with the target text data; extracting target semantic features of the target text data and extracting target image features of the target image data; and predicting the target attribute of the target object by using the target image feature and the target semantic feature to obtain target attribute information of the target attribute.

Optionally, extracting the target semantic feature of the target text data includes: and encoding the target text data by using a target bi-directional language model to obtain the target semantic features output by the bi-directional language model, wherein the target bi-directional language model is a pre-trained language representation model for extracting the semantic representation of each text unit in the input text data.

Optionally, extracting the target image feature of the target image data includes: inputting the target image data into a target convolutional neural network, wherein the target convolutional neural network is a pre-trained residual network for extracting image features of an input image; and extracting the characteristics output by the previous convolution layer of the full-connection layer of the target convolution neural network to obtain the target image characteristics.

Optionally, predicting the target attribute of the target object using the target image feature and the target semantic feature, and obtaining the target attribute information of the target attribute includes: performing cross-modal attention fusion on the target image features and the target semantic features to obtain target fusion features; and inputting the target fusion characteristics into a target attribute prediction model to obtain the target attribute information output by the target attribute prediction model.

Optionally, the target semantic feature comprises a first encoding vector for each text unit in the target text data; performing cross-modal attention fusion on the target image features and the target semantic features to obtain target fusion features, wherein the obtaining the target fusion features comprises the following steps: using the target image features to perform cross-modal attention coding on the target semantic features to obtain second coding vectors of each text unit; filtering the second coding vector of each text unit by using a cross-modal attention filter to obtain a third coding vector of each text unit, wherein the cross-modal attention filter is used for filtering image information in the second coding vector of the text unit irrelevant to the target image data in a bit mode; and splicing the first coding vector of each text unit with the third coding vector of each text unit to obtain a target coding vector of each text unit, wherein the target fusion characteristic comprises the target coding vector of each text unit.

Optionally, cross-modal attention encoding the target semantic feature using the target image feature, obtaining the second encoding vector for each text unit includes: determining the second coding vector of each text unit according to a first attention vector, a second attention vector and a cross-modal mapping matrix, wherein the first attention vector is an attention vector of a text mode corresponding to the target semantic feature, the second attention vector is a cross-modal attention vector of a text and picture mode corresponding to the target semantic feature and the target image feature, and the cross-modal mapping matrix is used for mapping the second attention vector from a visual semantic space to a text semantic space.

Optionally, inputting the target fusion feature to a target attribute prediction model, and obtaining the target attribute information output by the target attribute prediction model includes: inputting the target coding vector of each text unit into the target attribute prediction model to obtain an attribute prediction result corresponding to each text unit; and determining the target attribute information of the target attribute according to the attribute prediction result corresponding to each text unit.

According to another aspect of the embodiments of the present application, there is also provided an apparatus for acquiring attribute information, including: the acquisition unit is used for acquiring target text data of a target object and target image data with an association relation with the target text data; an extraction unit for extracting target semantic features of the target text data and extracting target image features of the target image data; and the prediction unit is used for predicting the target attribute of the target object by using the target image characteristic and the target semantic characteristic to obtain target attribute information of the target attribute.

Optionally, the extracting unit includes: the first extraction module is used for encoding the target text data by using a target bi-directional language model to obtain the target semantic features output by the bi-directional language model, wherein the target bi-directional language model is a pre-trained language representation model used for extracting the semantic representation of each text unit in the input text data.

Optionally, the extracting unit includes: the first input module is used for inputting the target image data into a target convolutional neural network, wherein the target convolutional neural network is a pre-trained residual network used for extracting image features of an input image; and the extraction module is used for extracting the characteristics output by the previous convolution layer of the full-connection layer of the target convolution neural network to obtain the target image characteristics.

Optionally, the prediction unit includes: the fusion module is used for carrying out cross-modal attention fusion on the target image features and the target semantic features to obtain target fusion features; and the second input module is used for inputting the target fusion characteristic into a target attribute prediction model to obtain the target attribute information output by the target attribute prediction model.

Optionally, the target semantic feature comprises a first encoding vector for each text unit in the target text data; the fusion module comprises: the encoding submodule is used for performing cross-modal attention encoding on the target semantic features by using the target image features to obtain a second encoding vector of each text unit; a filtering sub-module, configured to filter the second coding vector of each text unit by using a cross-modal attention filter, so as to obtain a third coding vector of each text unit, where the cross-modal attention filter is configured to filter, by bit, image information in the second coding vector of a text unit unrelated to the target image data; and the splicing module is used for splicing the first coding vector of each text unit with the third coding vector of each text unit to obtain a target coding vector of each text unit, wherein the target fusion characteristic comprises the target coding vector of each text unit.

Optionally, the encoding submodule includes: a determining subunit, configured to determine the second encoding vector of each text unit according to a first attention vector, a second attention vector and a cross-modal mapping matrix, where the first attention vector is an attention vector of a text modality corresponding to the target semantic feature, the second attention vector is a cross-modal attention vector of a text and picture modality corresponding to the target semantic feature and the target image feature, and the cross-modal mapping matrix is configured to map the second attention vector from a visual semantic space to a text semantic space.

Optionally, the second input module includes: the input sub-module is used for inputting the target coding vector of each text unit into the target attribute prediction model to obtain an attribute prediction result corresponding to each text unit; and the determining submodule is used for determining the target attribute information of the target attribute according to the attribute prediction result corresponding to each text unit.

According to yet another aspect of the embodiments of the present application, there is also provided an electronic device including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete communication with each other through the communication bus; wherein the memory is used for storing a computer program; a processor for performing the method steps of any of the embodiments described above by running the computer program stored on the memory.

According to a further aspect of the embodiments of the present application, there is also provided a computer-readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the method steps of any of the embodiments described above when run.

In the embodiment of the application, a mode of extracting attribute information by combining text data with image data is adopted, and target text data of a target object and target image data with an association relation with the target text data are obtained; extracting target semantic features of target text data and extracting target image features of target image data; the target image features and the target semantic features are used for predicting the target attributes of the target object to obtain target attribute information of the target attributes, and the text data and the image information are combined to acquire the attribute information by fusing various information, so that the technical effects of improving the accuracy of information extraction and improving the efficiency of attribute information filling are achieved, and the problem that the efficiency of attribute information filling is low due to inaccurate information extraction in the mode of attribute value filling in the related art is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic diagram of a hardware environment of an alternative method for obtaining attribute information according to an embodiment of the present application;

FIG. 2 is a flow chart of an alternative method for obtaining attribute information according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative method for obtaining attribute information according to an embodiment of the present application;

FIG. 4 is a flow chart of another alternative method for obtaining attribute information according to an embodiment of the present application;

FIG. 5 is a block diagram of an alternative attribute information acquisition device according to an embodiment of the present application;

fig. 6 is a block diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to one aspect of the embodiments of the present application, a method for acquiring attribute information is provided. Alternatively, in the present embodiment, the above-described method of acquiring attribute information may be applied to a hardware environment constituted by the terminal 102 and the server 104 as shown in fig. 1. As shown in fig. 1, the server 104 is connected to the terminal 102 through a network, and may be used to provide services (such as game services, application services, etc.) to the terminal or clients installed on the terminal, and a database may be provided on the server or independent of the server, for providing data storage services to the server 104.

The network may include, but is not limited to, at least one of: wired network, wireless network. The wired network may include, but is not limited to, at least one of: a wide area network, a metropolitan area network, a local area network, and the wireless network may include, but is not limited to, at least one of: WIFI (Wireless Fidelity ), bluetooth. The terminal 102 may not be limited to a PC, a mobile phone, a tablet computer, etc.

The method for acquiring attribute information in the embodiment of the present application may be executed by the server 104, may be executed by the terminal 102, or may be executed by both the server 104 and the terminal 102. The method for obtaining the attribute information by the terminal 102 according to the embodiment of the present application may be performed by a client installed thereon.

Taking the example that the server 104 performs the method for acquiring attribute information in the present embodiment, fig. 2 is a schematic flow chart of an alternative method for acquiring attribute information according to the embodiment of the present application, as shown in fig. 2, the flow of the method may include the following steps:

step S202, target text data of a target object and target image data with an association relation with the target text data are acquired.

The method for acquiring attribute information in the embodiment may be applied to a scene with an attribute information acquisition requirement, for example, an electronic market scene, or other scenes. Taking an e-commerce scene as an example, for a garment shown in fig. 3, a description page of the garment includes a section of product description text (for example, "the garment is a golden stand-collar T-shirt") and a product picture, and the list of the garment does not include attribute information of properties such as collar type, color and the like, so that the missing attribute information needs to be extracted for filling.

For a target object, the target object may be a product (e.g., clothing, tableware, etc.), or may be another type of object, and the server may obtain target text data and target image data of the target object from a local, terminal device, another server, or other device. The target image data has an association relationship with the target text data, that is, the target text data describes a target object contained in the target image data.

Step S204, extracting target semantic features of the target text data, and extracting target image features of the target image data.

The server can extract semantic features of the target text data to obtain target semantic features. The extraction manner of the semantic features of the target text data may be various, for example, a two-way language model or other models capable of extracting semantic features are used for extracting semantic features, which is not particularly limited in this embodiment.

When extracting the semantic features of the target text data, coding processing can be performed by taking the text units as units, so as to obtain coding vectors corresponding to each text unit. Each text unit may correspond to a coding vector and the target semantic feature comprises the coding vector for each text unit. The encoding process may be based on text units preceding the text unit, text units preceding and following the text unit (which may be a bi-directional language model used), text units following the text unit. The text units may be words, terms, or other types of text units.

The server may perform image feature extraction on the target image data to obtain target image features. The image feature of the target image data may be extracted in various ways, for example, a convolutional neural network or other model capable of extracting image features may be used for extracting image features, and the convolutional neural network may be a residual network, for example, resNet-50, resNet-101, etc., which is not limited in this embodiment.

Step S206, predicting the target attribute of the target object by using the target image feature and the target semantic feature to obtain target attribute information of the target attribute.

The server can fuse the target image features and the target semantic features, so that attribute information prediction of target attributes can be performed based on the fused features, and the predicted result is the target attribute information. There are various ways to fuse the target image feature and the target semantic feature, for example, one feature is used to assist another feature in predicting attribute information, and for example, the two features are used together to predict attribute information, which is not limited in this embodiment.

The number of target attributes may be one or more, and may be extracted from the target text data or may be preconfigured. When the number of the target attributes is plural, attribute information of different target attributes may be predicted respectively, for example, prediction of plural kinds of attribute information may be performed using different attribute prediction models, and for example, prediction of plural kinds of attribute information may be performed using the same attribute prediction model, and the attribute prediction model may predict plural kinds of attribute information simultaneously or predict only one piece of attribute information at the same time.

The obtained target attribute information may be used to complement missing attribute information, for example, to complement the attribute information of the target attribute in the attribute list of the target object. The operation of supplementing the attribute information may be performed by the server, may be performed by the terminal device, or may be performed by both.

It should be noted that, the method for acquiring attribute information in this embodiment may also be performed by a terminal device, for example, the terminal device may obtain the target text data and the target image data by selecting from a local area, acquiring from another terminal device or a server, and predicting attribute information of a target attribute using the obtained target text data and the target image data, which is similar to the foregoing, and will not be described herein.

Through the steps S202 to S206, target text data of the target object and target image data having an association relationship with the target text data are acquired; extracting target semantic features of target text data and extracting target image features of target image data; the target image features and the target semantic features are used for predicting the target attributes of the target object to obtain target attribute information of the target attributes, so that the problem that the attribute information is low in efficiency due to inaccurate information extraction in the attribute value supplementing mode in the related technology is solved, the accuracy of information extraction is improved, and the efficiency of attribute information supplementing is improved.

As an alternative embodiment, extracting the target semantic features of the target text data includes:

s11, encoding the target text data by using a target bi-directional language model to obtain target semantic features output by the bi-directional language model, wherein the target bi-directional language model is a pre-trained language representation model for extracting semantic representation of each text unit of the input text data.

The server may encode the target text data using the target bi-directional language model, i.e., the server may input the target text data into the target bi-directional language model. The target bi-directional language model is a pre-trained language representation model that can be used to extract a semantic representation (i.e., a text vector) of each text unit of input text data, resulting in target semantic features.

For example, for a product as shown in fig. 3, the terminal device may use the BERT (Bidirectional Encoder Representations from Transformers, bi-directional coded representation based on a converter) model to encode the natural language descriptive text (an example of the target text data described above) of the product to obtain a semantic representation h= (h) of the text ₀ ,h ₁ ,h ₂ ,…,h _N ) N is the total length of the input text, i.e. the number of words contained in the natural language description text.

According to the embodiment, the semantic feature extraction of the text data is performed through the two-way language model, so that the accuracy of the semantic feature extraction can be improved.

As an alternative embodiment, extracting the target image features of the target image data comprises:

s21, inputting target image data into a target convolutional neural network, wherein the target convolutional neural network is a pre-trained residual network for extracting image features of an input image;

s22, extracting the characteristics output by the previous convolution layer of the full-connection layer of the target convolution neural network to obtain target image characteristics.

The server may extract image features of the image data using a target convolutional neural network, which may be any convolutional neural network capable of image feature extraction. Alternatively, the target convolutional neural network may be a pre-trained residual network that may be used to extract image features of the input image.

The target convolutional neural network may be a residual network that contains multiple layers (e.g., convolutional layers, fully-connected layers, etc.). The server may input the target image data into the target convolutional neural network and then extract the features output by the last convolutional layer of the target convolutional neural network (i.e., the previous convolutional layer of the fully-connected layer), thereby obtaining target image features.

For example, the server may use the pretrained convolutional neural network ResNet-101 to encode a product picture (an example of target image data), extract features of the conv5 layer, i.e., v= (v) ₀ ,v ₁ ,v ₂ ,…,v ₄₉ )。

By the embodiment, the accuracy of image feature extraction can be improved by using the residual network to extract the image features of the image.

As an alternative embodiment, predicting the target attribute of the target object using the target image feature and the target semantic feature, to obtain target attribute information of the target attribute includes:

s31, performing cross-modal attention fusion on the target image features and the target semantic features to obtain target fusion features;

s32, inputting the target fusion characteristics into the target attribute prediction model to obtain target attribute information output by the target attribute prediction model.

After the target image features and the target semantic features are obtained, the server can perform cross-modal attention fusion on the target image features and the target semantic features to obtain target fusion features. Because the cross-modal attention fusion of the target image features and the target semantic features is performed, the target object can be more comprehensively and accurately represented.

The server may input the target fusion feature to a pre-trained target attribute prediction model, and after the target fusion feature is input to the target attribute prediction model, the server may obtain target attribute information of the target attribute output by the target attribute prediction model.

The target attribute prediction model may make predictions of attribute information for one or more target attributes. The types of attributes that each attribute prediction model can predict may be preconfigured, and each attribute may correspond to a variety of candidate attribute information. According to the input information, the full-connection layer of the attribute prediction model can output the probability corresponding to each candidate attribute information, so that the attribute information of the attribute can be determined according to the probability corresponding to each candidate attribute information.

By means of the method and the device, cross-modal attention fusion is conducted on the target image features and the target semantic features, and accuracy of attribute information prediction can be improved.

As an alternative embodiment, the target semantic feature may comprise a first encoding vector for each text unit in the target text data, and the target fusion feature may comprise a target encoding vector for each text unit. . Correspondingly, performing cross-modal attention fusion on the target image features and the target semantic features to obtain target fusion features, wherein the obtaining the target fusion features comprises the following steps:

s41, performing cross-modal attention coding on target semantic features by using target image features to obtain second coding vectors of each text unit;

S42, filtering the second coding vector of each text unit by using a cross-modal attention filter to obtain a third coding vector of each text unit, wherein the cross-modal attention filter is used for filtering image information in the second coding vector of the text unit irrelevant to the target image data according to bits;

s43, splicing the first coding vector of each text unit with the third coding vector of each text unit to obtain a target coding vector of each text unit, wherein the target fusion characteristic comprises the target coding vector of each text unit.

In the present embodiment, the image data may be used to assist the text data in extracting the attribute information of the target attribute. After deriving the first encoding vector for each text unit, the server may cross-modal attention encode the target semantic feature using the target image feature, e.g., twice encode the semantic representation of the text using a global cross-modal attention mechanism, yielding a second encoding vector for each text unit (h may be used _i 'representation'). The second code vector is the code vector after cross-model attention coding and contains multi-modal semantic information.

Text data may contain some text that is not related to image information, and image information should not be used in calculating the secondary encoding vector of the text. In this regard, the server may pre-configure a cross-modal attention filter, which may be used to filter (e.g., bitwise filter the input vector) image information in a second encoded vector of text units that are not related to the target image data.

After deriving the second encoded vector for each text unit, the server may use a cross-modal attention filter (g _i ) Filtering the second encoded vector of each text unit to obtain a third encoded vector of each text unit, e.g., h _i '⊙g _i The ". As indicated by the term". If, the term "indicates a bit-wise multiplication. The final text encoding (i.e., target encoding vector) encodes the original encoding (i.e., first encoding vector, h _i ) And the multi-modal filtered codes (i.e., the third code vector) are spliced, which can be expressed as: h is a _i ”＝[h _i ；h _i '⊙g _i ]Wherein, [ A; b (B)]Representing the concatenation of vectors a and B.

By this embodiment, by cross-modal attention encoding target semantic features using a global cross-modal attention mechanism,

as an alternative embodiment, cross-modal attention encoding the target semantic feature using the target image feature, obtaining the second encoding vector for each text unit includes:

S51, determining a second coding vector of each text unit according to a first attention vector, a second attention vector and a cross-modal mapping matrix, wherein the first attention vector is an attention vector of a text mode corresponding to a target semantic feature, the second attention vector is a cross-modal attention vector of a text and picture mode corresponding to the target semantic feature and the target image feature, and the cross-modal mapping matrix is used for mapping the second attention vector from a visual semantic space to a text semantic space.

In secondarily encoding the semantic representation of the text using the global cross-modal attention mechanism, the server may obtain attention vectors of the text modality corresponding to the target semantic feature (single-modal attention vector), i.e., a first attention vector, cross-modal attention vectors of the text and picture modality corresponding to the target semantic feature and the target image feature (cross-modal attention vector), i.e., a second attention vector, and a cross-modal mapping matrix for mapping the second attention vector from the visual semantic space to the text semantic space, respectively.

Alternatively, the first attention vector may be calculated using formula (1):

Wherein,is the attention vector of the text modality, +.>Is a scaled dot product model scoring function for text modalities, which may be calculated using equation (2):

wherein,for a sequence of query vectors of text modality, +.>A key vector sequence in a text mode, and d is the dimension of input information.

Alternatively, the second attention vector may be calculated using equation (3):

wherein,is a cross-modal attention vector for text and image modalities,>is a scaled dot product model scoring function for text and image modalities, which can be calculated using equation (4):

wherein,query vector sequence for text and image modalities, < +.>A sequence of key vectors for text and image modalities, d being the dimension of the input information.

After deriving the first and second attention vectors, the server may calculate a second encoded vector using equation (5):

wherein,for a sequence of value vectors of the text modality, +.>Value vector sequences for text and image modalities, W ^g For cross-modal mapping matrix, the aim is to encode the picture vector +.>Mapping from visual semantic space to text semantic space.

Alternatively, for the third encoded vector determined in the above manner, a picture information filter (i.e., the foregoing cross-modal attention filter) that filters it may be as shown in equation (6):

Wherein W is ₁ And W is ₂ Is a parameter matrix.

According to the embodiment, the cross-modal coding vector of each text unit is generated by using the single-mode attention vector, the cross-modal attention vector and the cross-modal mapping matrix, so that the cross-modal fusion efficiency of the semantic features and the image features can be improved.

As an optional embodiment, inputting the target fusion feature into the target attribute prediction model, and obtaining the target attribute information output by the target attribute prediction model includes:

s61, inputting the target coding vector of each text unit into a target attribute prediction model to obtain an attribute prediction result corresponding to each text unit;

s62, determining target attribute information of the target attribute according to the attribute prediction result corresponding to each text unit.

When predicting the attribute information of the target attribute, the server may input the target encoding vector of each text unit into the target attribute prediction model to obtain an attribute prediction result corresponding to each text unit. By integrating the attribute prediction results corresponding to each text unit, the server can determine target attribute information of the target attribute.

For example, the final text encoding vector (i.e., target encoding vector) is input into the attribute prediction model, the product attribute corresponding to the input text is calculated, and the calculation formula is shown in formula (7):

Wherein W is ₂ As a parameter matrix, y _i Probability distribution for the predicted attribute. Correspondingly, in training the attribute prediction model, the loss function used is: loss=cross entropy (y _i ,z _i ) Wherein z is _i Is a true attribute.

By the embodiment, the attribute information of the target attribute of the target object is predicted by using the target coding vector of each text unit, so that the accuracy of the attribute information prediction can be improved.

The following explains the method for acquiring attribute information in the embodiment of the present application with reference to an alternative example. For a multi-modal product attribute information extraction task, in this example, with the aid of a product picture, attributes of a product and corresponding attribute values thereof are extracted from a product description text.

As shown in fig. 4, the flow of the acquisition method of attribute information in this alternative example may include the steps of:

step S402, obtaining a product description text and a product picture of a product.

Step S404, the BERT is used for encoding the product description text, so that semantic representation of the text, namely, original encoding of the text is obtained.

And step S406, coding the product picture by using a pretrained convolutional neural network ResNet-101, extracting the characteristics of the conv5 layer, and obtaining the coding vector of the product picture.

Step S408, performing secondary coding on the semantic representation of the text by using a global cross-modal attention mechanism according to the coding vector of the picture, and performing multi-modal filtering on the coding vector after cross-modal attention coding to obtain a multi-modal filtered code.

Step S410, splicing the original codes and the codes subjected to multi-mode filtering to obtain final text coding vectors, inputting the final text coding vectors into an attribute prediction model, and calculating product attributes corresponding to the input text.

Taking as an example in fig. 3, by supplementing product attribute information with a product description text using a commodity picture, product attribute information contained in the product description text may be output, and "attribute: attribute value ", for example," collar: collar standing "," color: gold color).

According to the embodiment, the product attribute information is supplemented by using the picture, so that the product attribute information supplementing efficiency can be improved; constructing a text secondary coding vector by using a global cross-modal attention mechanism, so that the capability of the text coding vector for representing a target object can be improved; by using the cross-modal attention filter, the information irrelevant to the text and picture information is filtered, and the accuracy of product attribute information filling can be improved.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM (Read-Only Memory)/RAM (Random Access Memory ), magnetic disk, optical disc), including instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method described in the embodiments of the present application.

According to another aspect of the embodiments of the present application, there is also provided an attribute information acquisition apparatus for implementing the above-described attribute information acquisition method. Fig. 5 is a block diagram of an alternative attribute information acquisition apparatus according to an embodiment of the present application, and as shown in fig. 5, the apparatus may include:

an acquiring unit 502, configured to acquire target text data of a target object and target image data having an association relationship with the target text data;

an extracting unit 504, connected to the acquiring unit 502, for extracting target semantic features of the target text data and extracting target image features of the target image data;

and the prediction unit 506 is connected to the extraction unit 504, and is configured to predict the target attribute of the target object by using the target image feature and the target semantic feature, so as to obtain target attribute information of the target attribute.

It should be noted that, the acquiring unit 502 in this embodiment may be used to perform the step S202 described above, the extracting unit 504 in this embodiment may be used to perform the step S204 described above, and the predicting unit 506 in this embodiment may be used to perform the step S206 described above.

Through the module, the target text data of the target object and the target image data with an association relation with the target text data are obtained; extracting target semantic features of target text data and extracting target image features of target image data; the target image features and the target semantic features are used for predicting the target attributes of the target object to obtain target attribute information of the target attributes, so that the problem that the attribute information is low in efficiency due to inaccurate information extraction in the attribute value supplementing mode in the related technology is solved, the accuracy of information extraction is improved, and the efficiency of attribute information supplementing is improved.

As an alternative embodiment, the extraction unit 504 includes:

the first extraction module is used for encoding the target text data by using a target bi-directional language model to obtain target semantic features output by the bi-directional language model, wherein the target bi-directional language model is a pre-trained language representation model used for extracting semantic representation of each text unit in the input text data.

As an alternative embodiment, the extraction unit 504 includes:

the first input module is used for inputting target image data into a target convolutional neural network, wherein the target convolutional neural network is a pre-trained residual network used for extracting image features of an input image;

and the extraction module is used for extracting the characteristics output by the previous convolution layer of the full-connection layer of the target convolution neural network to obtain the target image characteristics.

As an alternative embodiment, the prediction unit 506 includes:

the fusion module is used for carrying out cross-modal attention fusion on the target image features and the target semantic features to obtain target fusion features;

and the second input module is used for inputting the target fusion characteristics into the target attribute prediction model to obtain target attribute information output by the target attribute prediction model.

As an alternative embodiment, the target semantic feature comprises a first encoding vector for each text unit in the target text data; the fusion module comprises:

the coding sub-module is used for performing cross-modal attention coding on the target semantic features by using the target image features to obtain a second coding vector of each text unit;

a filtering sub-module, configured to filter the second coding vector of each text unit by using a cross-modal attention filter, to obtain a third coding vector of each text unit, where the cross-modal attention filter is configured to filter, by bits, image information in the second coding vector of the text unit that is unrelated to the target image data;

and the splicing module is used for splicing the first coding vector of each text unit with the third coding vector of each text unit to obtain a target coding vector of each text unit, wherein the target fusion characteristic comprises the target coding vector of each text unit.

As an alternative embodiment, the encoding submodule includes:

a determining subunit, configured to determine a second encoding vector of each text unit according to a first attention vector, a second attention vector and a cross-modal mapping matrix, where the first attention vector is an attention vector of a text modality corresponding to the target semantic feature, the second attention vector is a cross-modal attention vector of a text and picture modality corresponding to the target semantic feature and the target image feature, and the cross-modal mapping matrix is configured to map the second attention vector from the visual semantic space to the text semantic space.

As an alternative embodiment, the second input module includes:

the input sub-module is used for inputting the target coding vector of each text unit into the target attribute prediction model to obtain an attribute prediction result corresponding to each text unit;

and the determining submodule is used for determining target attribute information of the target attribute according to the attribute prediction result corresponding to each text unit.

It should be noted that the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the above embodiments. It should be noted that the above modules may be implemented in software or in hardware as part of the apparatus shown in fig. 1, where the hardware environment includes a network environment.

According to still another aspect of the embodiments of the present application, there is further provided an electronic device for implementing the above-mentioned method for obtaining attribute information, where the electronic device may be a server, a terminal, or a combination thereof.

Fig. 6 is a block diagram of an alternative electronic device, according to an embodiment of the present application, including a processor 602, a communication interface 604, a memory 606, and a communication bus 608, as shown in fig. 6, wherein the processor 602, the communication interface 604, and the memory 606 communicate with each other via the communication bus 608, wherein,

A memory 606 for storing a computer program;

the processor 602, when executing the computer program stored on the memory 606, performs the following steps:

s1, acquiring target text data of a target object and target image data with an association relation with the target text data;

s2, extracting target semantic features of target text data and extracting target image features of target image data;

s3, predicting the target attribute of the target object by using the target image feature and the target semantic feature to obtain target attribute information of the target attribute.

Alternatively, in the present embodiment, the above-described communication bus may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one thick line is shown in fig. 6, but not only one bus or one type of bus. The communication interface is used for communication between the electronic device and other devices.

The memory may include RAM or may include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

As an example, the above memory 606 may be provided with an acquisition unit 502, an extraction unit 504, and a prediction unit 506 in an acquisition apparatus including the above attribute information. In addition, other module units in the above-mentioned attribute information obtaining apparatus may be further included, but are not limited thereto, and are not described in detail in this example.

The processor may be a general purpose processor and may include, but is not limited to: CPU (Central Processing Unit ), NP (Network Processor, network processor), etc.; but also DSP (Digital Signal Processing, digital signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field-Programmable Gate Array, field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments, and this embodiment is not described herein.

It will be understood by those skilled in the art that the structure shown in fig. 6 is only schematic, and the device implementing the above-mentioned method for obtaining attribute information may be a terminal device, and the terminal device may be a smart phone (such as an Android mobile phone, an iOS mobile phone, etc.), a tablet computer, a palmtop computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 6 is not limited to the structure of the electronic device described above. For example, the electronic device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 6, or have a different configuration than shown in FIG. 6.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, ROM, RAM, magnetic or optical disk, etc.

According to yet another aspect of embodiments of the present application, there is also provided a storage medium. Alternatively, in the present embodiment, the storage medium described above may be used to execute the program code of the acquisition method of attribute information of any of the above-described items in the embodiments of the present application.

Alternatively, in this embodiment, the storage medium may be located on at least one network device of the plurality of network devices in the network shown in the above embodiment.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of:

Alternatively, specific examples in the present embodiment may refer to examples described in the above embodiments, which are not described in detail in the present embodiment.

Alternatively, in the present embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a U disk, ROM, RAM, a mobile hard disk, a magnetic disk or an optical disk.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present application.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution provided in the present embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims

1. The method for acquiring the attribute information is characterized by comprising the following steps:

acquiring target text data of a target object and target image data with an association relationship with the target text data;

extracting target semantic features of the target text data and extracting target image features of the target image data;

predicting target attributes of the target object by using the target image features and the target semantic features to obtain target attribute information of the target attributes;

predicting the target attribute of the target object by using the target image feature and the target semantic feature, and obtaining target attribute information of the target attribute includes:

performing cross-modal attention fusion on the target image features and the target semantic features to obtain target fusion features;

inputting the target fusion characteristics into a target attribute prediction model to obtain the target attribute information output by the target attribute prediction model;

The target semantic feature comprises a first encoding vector for each text unit in the target text data;

performing cross-modal attention fusion on the target image features and the target semantic features to obtain target fusion features, wherein the obtaining the target fusion features comprises the following steps:

using the target image features to perform cross-modal attention coding on the target semantic features to obtain second coding vectors of each text unit;

filtering the second coding vector of each text unit by using a cross-modal attention filter to obtain a third coding vector of each text unit, wherein the cross-modal attention filter is used for filtering image information in the second coding vector of the text unit irrelevant to the target image data in a bit mode;

and splicing the first coding vector of each text unit with the third coding vector of each text unit to obtain a target coding vector of each text unit, wherein the target fusion characteristic comprises the target coding vector of each text unit.

2. The method of claim 1, wherein extracting the target semantic features of the target text data comprises:

And encoding the target text data by using a target bi-directional language model to obtain the target semantic features output by the bi-directional language model, wherein the target bi-directional language model is a pre-trained language representation model for extracting semantic representation of each text unit of the input text data.

3. The method of claim 1, wherein extracting the target image features of the target image data comprises:

inputting the target image data into a target convolutional neural network, wherein the target convolutional neural network is a pre-trained residual network for extracting image features of an input image;

and extracting the characteristics output by the previous convolution layer of the full-connection layer of the target convolution neural network to obtain the target image characteristics.

4. The method of claim 1, wherein cross-modal attention encoding the target semantic feature using the target image feature to obtain the second encoding vector for each text unit comprises:

determining the second coding vector of each text unit according to a first attention vector, a second attention vector and a cross-modal mapping matrix, wherein the first attention vector is an attention vector of a text mode corresponding to the target semantic feature, the second attention vector is a cross-modal attention vector of a text and picture mode corresponding to the target semantic feature and the target image feature, and the cross-modal mapping matrix is used for mapping the picture coding vector of the target image data from a visual semantic space to a text semantic space.

5. The method of claim 1, wherein inputting the target fusion feature into a target attribute prediction model to obtain the target attribute information output by the target attribute prediction model comprises:

inputting the target coding vector of each text unit into the target attribute prediction model to obtain an attribute prediction result corresponding to each text unit;

and determining the target attribute information of the target attribute according to the attribute prediction result corresponding to each text unit.

6. An apparatus for acquiring attribute information, comprising:

the acquisition unit is used for acquiring target text data of a target object and target image data with an association relation with the target text data;

an extraction unit for extracting target semantic features of the target text data and extracting target image features of the target image data;

the prediction unit is used for predicting the target attribute of the target object by using the target image feature and the target semantic feature to obtain target attribute information of the target attribute;

the prediction unit includes:

The second input module is used for inputting the target fusion characteristics into the target attribute prediction model to obtain target attribute information output by the target attribute prediction model;

the target semantic feature comprises a first encoding vector for each text unit in the target text data; the fusion module comprises:

7. An electronic device comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus, characterized in that,

The memory is used for storing a computer program;

the processor is configured to perform the method steps of any of claims 1 to 5 by running the computer program stored on the memory.

8. A computer-readable storage medium, characterized in that the storage medium has stored therein a computer program, wherein the computer program is arranged to perform the method steps of any of claims 1 to 5 when run.