CN114548323A

CN114548323A - Commodity classification method, equipment and computer storage medium

Info

Publication number: CN114548323A
Application number: CN202210406335.8A
Authority: CN
Inventors: 章宦记; 孙可嘉; 李彤
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-05-27

Abstract

The embodiment of the invention provides a commodity classification method, commodity classification equipment and a computer storage medium. The method comprises the following steps: acquiring commodity information to be processed, wherein the commodity information comprises image information and text information; feature extraction is carried out on the commodity information by utilizing the image-text fusion feature model, and image-text fusion features corresponding to the commodity information are obtained; performing feature extraction on the image information by using an image feature extraction model to obtain image features corresponding to the commodity information; performing feature extraction on the text information by using a text feature extraction model to obtain text features corresponding to the commodity information; and processing the image-text fusion characteristics, the image characteristics and the text characteristics by using the deep neural network model to obtain category information corresponding to the commodity information. According to the technical scheme provided by the embodiment, different characteristics of the commodity information are extracted through a plurality of different characteristic extraction models, and the different characteristics are processed based on the deep neural network model, so that the commodity type information can be stably obtained.

Description

Commodity classification method, equipment and computer storage medium

Technical Field

The present invention relates to the field of commodity classification, and in particular, to a method and apparatus for commodity classification and a computer storage medium.

Background

In the application scene of E-commerce, the mounting of commodities to related categories is always a pain point and a difficulty point for exploration. At present, key information such as a commodity title and the like are generally adopted to be matched with related categories or categories for mounting and classifying, however, the expression requirement of the commodity title is high, the accuracy of the title expression directly determines the accuracy of the commodity mounting and classifying, and if the accuracy and the specialty of the commodity expression are poor, the subsequent identification of the commodity categories and the identification operation of the related categories are affected, for example: identifying the same style, judging similar articles and the like.

Disclosure of Invention

The embodiment of the invention provides a commodity classification method, equipment and a computer storage medium, which can extract different characteristics of commodity information by combining a graphic fusion characteristic model, an image characteristic extraction model and a text characteristic extraction characteristic, and then analyze and process the different characteristics by using a deep neural network model to obtain the class information of the commodity information, thereby effectively improving the accuracy and reliability of identifying the commodity class.

In a first aspect, an embodiment of the present invention provides a method for classifying a commodity, including:

acquiring commodity information to be processed, wherein the commodity information comprises image information and text information;

feature extraction is carried out on the commodity information by utilizing a picture-text fusion feature model, and picture-text fusion features corresponding to the commodity information are obtained;

performing feature extraction on the image information by using an image feature extraction model to obtain image features corresponding to the commodity information;

performing feature extraction on the text information by using a text feature extraction model to obtain text features corresponding to the commodity information;

and processing the image-text fusion characteristics, the image characteristics and the text characteristics by using a deep neural network model to obtain the category information corresponding to the commodity information.

In a second aspect, an embodiment of the present invention provides a product sorting apparatus, including:

the system comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring commodity information to be processed, and the commodity information comprises image information and text information;

the first extraction module is used for extracting the characteristics of the commodity information by using a picture-text fusion characteristic model to obtain picture-text fusion characteristics corresponding to the commodity information;

the first extraction module is further used for extracting the features of the image information by using an image feature extraction model to obtain the image features corresponding to the commodity information;

the first extraction module is further used for extracting features of the text information by using a text feature extraction model to obtain text features corresponding to the commodity information;

and the first processing module is used for processing the image-text fusion characteristics, the image characteristics and the text characteristics by using a deep neural network model to obtain the category information corresponding to the commodity information.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the processor, implement the method of classifying items in the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer storage medium for storing a computer program, where the computer program is used to make a computer execute a method for classifying commodities in the first aspect.

In a fifth aspect, an embodiment of the present invention provides a computer program product, including: a computer-readable storage medium storing computer instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the method of classifying items as described above in relation to the first aspect.

According to the technical scheme provided by the embodiment, the commodity information to be processed is acquired; then, image information and text information in the commodity information are subjected to feature extraction operation by using an image-text fusion feature model, and image-text fusion features corresponding to the commodity information are obtained; similarly, the image feature extraction model is used for carrying out feature extraction operation on the image information in the commodity information, so that the image features can be obtained; and performing feature extraction on the text information by using a text feature extraction model to obtain text features corresponding to the commodity information. After the image-text fusion characteristics, the image characteristics and the text characteristics are obtained, the image-text fusion characteristics, the image characteristics and the text characteristics can be processed by using the deep neural network model, so that the commodity information corresponding to the commodity information can be stably obtained, the commodity classification operation is effectively realized by fusing the characteristics of different dimensions obtained by various models, the accuracy and the reliability of identifying the commodity categories are effectively improved, and the practicability of the method is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating a method for classifying commodities according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a method for classifying commodities according to an embodiment of the present invention;

fig. 3 is a schematic flow chart illustrating feature extraction performed on the commodity information by using a graph-text fusion feature model to obtain a graph-text fusion feature corresponding to the commodity information according to an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating a principle that image-text fusion features corresponding to the commodity information are obtained by extracting features of the commodity information by using an image-text fusion feature model according to an embodiment of the present invention;

fig. 5 is a schematic flow chart of another method for classifying commodities according to an embodiment of the present invention;

fig. 6 is a schematic flow chart illustrating another method for classifying commodities according to an embodiment of the present invention;

fig. 7 is a schematic flowchart of another method for classifying commodities according to an embodiment of the present invention;

FIG. 8 is a schematic block diagram of a method for classifying commodities according to an embodiment of the present invention;

FIG. 9 is an architecture diagram of a Bert model according to an embodiment of the present invention;

FIG. 10 is an architecture diagram of a Resnet image model according to an embodiment of the present invention;

FIG. 11 is an architecture diagram of a deep neural network model DNN according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of a commodity sorting device according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of an electronic device corresponding to the article sorting device provided in the embodiment shown in fig. 12.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and "a plurality" typically includes at least two, but does not exclude the presence of at least one.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrase "if determined" or "if detected (a stated condition or event)" may be interpreted as "upon determining" or "in response to determining" or "upon detecting (a stated condition or event)" or "in response to detecting (a stated condition or event)", depending on the context.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of additional like elements in the article or system in which the element is included.

In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

Definition of terms:

m6: multi-modification to Multi-modification Multitask Mega-transformer, a self-developed very large scale Chinese pre-training model.

And (5) Bert: bidirectional Encoder replication from transformations, a pre-trained language characterization model.

Resnet: residual Network, a deep learning neural Network.

DNN: deep Neural Networks, which can be understood as Neural Networks with many hidden layers.

In order to facilitate understanding of specific implementation processes and implementation effects of the commodity classification method, the commodity classification device, and the computer storage medium in this embodiment, the following briefly describes related technologies:

In addition, as the general commodities often have image information besides the titles, with the rapid development of the artificial intelligence technology, the commodity images contain rich information, the commodities are expressed through the commodity pictures and texts, and then the corresponding categories of the commodities are mounted together, so that the method is more meaningful. Based on this, the related art provides a method for classifying a commodity, and specifically, the method for classifying a commodity includes: the method comprises the steps of obtaining a commodity text and a commodity image, obtaining a commodity classification result 1 based on the commodity text, obtaining a commodity classification result 2 based on the commodity image, and then fusing the commodity classification result 1 and the commodity classification result 2 to obtain a commodity category. However, in the above implementation manner, the identification operation of the commodity category is implemented by performing a cutting operation on the image-text of the commodity, only a fusion operation is performed on the final result, and the adopted algorithm model is relatively simple, so that the accuracy and reliability of the identification of the commodity category cannot be ensured.

In order to solve the above technical problem, the present embodiment provides a method, an apparatus, and a computer storage medium for classifying a commodity, and referring to fig. 1, an execution main body of the method for classifying a commodity may be a commodity classification device, and the commodity classification device may be in communication connection with a request end to implement a commodity classification operation.

The requesting end may be any computing device with certain data transmission capability, and in particular, the requesting end may be a mobile phone, a personal computer PC, a tablet computer, a set application program, and the like. In addition, the basic structure of the request side may include: at least one processor. The number of processors depends on the configuration and type of the requesting end. The request end may also include a Memory, which may be volatile, such as RAM, or non-volatile, such as Read-Only Memory (ROM), flash Memory, etc., or may include both types. The memory typically stores an Operating System (OS), one or more application programs, and may also store program data and the like. Besides the processing unit and the memory, the request side also includes some basic configurations, such as a network card chip, an IO bus, a display component, some peripheral devices, and the like. Alternatively, some peripheral devices may include, for example, a keyboard, a mouse, a stylus, a printer, and the like. Other peripheral devices are well known in the art and will not be described in detail herein.

The commodity classification device is a device that can provide a commodity classification service in a network virtual environment, and is generally a device that performs information planning and commodity classification operations using a network. In physical implementation, the commodity classification device may be any device capable of providing a computing service, responding to a service request, and performing processing, for example: can be cluster servers, regular servers, cloud hosts, virtual centers, and the like. The commodity classifying device mainly comprises a processor, a hard disk, a memory, a system bus and the like, and is similar to a general computer framework.

In the above embodiment, the request terminal may be in network connection with the article classification device, and the network connection may be wireless or wired. If the request end is connected to the product classifying device in a communication manner, the network format of the mobile network may be any one of 2G (gsm), 2.5G (gprs), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4G + (LTE +), WiMax, 5G, and 6G.

In the embodiment of the application, the request end may obtain information of a commodity to be processed, specifically, the information of the commodity to be processed may be obtained based on an execution operation input by a user, the information of the commodity may include image information and text information, the image information may include six views, a partial detail view, an enlargement effect view, and the like of the commodity, and the text information may include title information, description information, and the like of the commodity. After the request end acquires the commodity information to be processed, the commodity information to be processed can be sent to the commodity classification device, so that the commodity classification device can acquire the commodity information to be processed and analyze the commodity information to be processed.

The commodity classification device is used for acquiring information of commodities to be processed; then, feature extraction is carried out on the commodity information by utilizing the image-text fusion feature model, specifically, feature extraction operation can be carried out on image information and text information in the commodity information by utilizing the image-text fusion feature model, and image-text fusion features corresponding to the commodity information are obtained; similarly, the image feature extraction model can be used for carrying out feature extraction operation on the image information in the commodity information, so that image features can be obtained; and performing feature extraction on the text information by using a text feature extraction model to obtain text features corresponding to the commodity information. After the image-text fusion characteristics, the image characteristics and the text characteristics are obtained, the image-text fusion characteristics, the image characteristics and the text characteristics can be processed by using the deep neural network model, so that the category information corresponding to the commodity information can be stably obtained.

According to the technical scheme provided by the embodiment, the commodity information to be processed is acquired; then, feature extraction operation is carried out on the image information and the text information in the commodity information by using the image-text fusion feature model, and image-text fusion features corresponding to the commodity information are obtained; similarly, the image feature extraction model can be used for carrying out feature extraction operation on the image information in the commodity information to obtain image features; and performing feature extraction on the text information by using a text feature extraction model to obtain text features corresponding to the commodity information. After the image-text fusion characteristics, the image characteristics and the text characteristics are obtained, the image-text fusion characteristics, the image characteristics and the text characteristics are processed by using the deep neural network model, so that the commodity classification information corresponding to the commodity information can be stably obtained, the different characteristics obtained by fusing various models are effectively fused to carry out commodity classification operation, the accuracy and the reliability of commodity classification prediction are effectively improved, and the practicability of the method is further improved.

Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The features of the embodiments and examples described below may be combined with each other without conflict between the embodiments.

Fig. 2 is a schematic flow chart of a method for classifying commodities according to an embodiment of the present invention; referring to fig. 2, the embodiment provides a method for classifying commodities, the main execution subject of the method is a commodity classification device, it can be understood that the commodity classification device can be implemented as software or a combination of software and hardware, and specifically, the method for classifying commodities can include:

step S201: and acquiring commodity information to be processed, wherein the commodity information comprises image information and text information.

Step S202: and performing feature extraction on the commodity information by using the image-text fusion feature model to obtain image-text fusion features corresponding to the commodity information.

Step S203: and performing feature extraction on the image information by using the image feature extraction model to obtain image features corresponding to the commodity information.

Step S204: and performing feature extraction on the text information by using a text feature extraction model to obtain text features corresponding to the commodity information.

Step S205: and processing the image-text fusion characteristics, the image characteristics and the text characteristics by using the deep neural network model to obtain the category information corresponding to the commodity information.

The following is a detailed description of specific implementation processes and implementation effects of the above steps:

The commodity information to be processed may refer to commodity information that needs to identify a commodity category, the commodity information may be issued commodity information or unreleased commodity information, the commodity information may include image information and text information, the image information may include six views, a detailed display diagram, an enlarged display diagram, and the like of a commodity, and the text information may include title information, description information, and the like of the commodity.

In addition, a specific implementation manner of obtaining the to-be-processed commodity information is not limited in this embodiment, and in some examples, the to-be-processed commodity information may be stored in a preset area, and the to-be-processed commodity information may be obtained by accessing the preset area. In other examples, the to-be-processed commodity information may be generated based on an execution operation input by a user, and at this time, acquiring the to-be-processed commodity information may include: the method comprises the steps of obtaining a display interface which is configured in advance in a commodity classification device and used for carrying out interactive operation with a user, obtaining execution operation input by the user in the display interface, and generating to-be-processed commodity information based on the execution operation. In still other examples, the information about the goods to be processed may be sent by the third device to the goods sorting device, and in this case, the obtaining the information about the goods to be processed may include: and acquiring third equipment in communication connection with the commodity classification device, and actively or passively acquiring information of the commodities to be processed through the third equipment.

Of course, the obtaining manner of the to-be-processed commodity information is not limited to the implementation manner illustrated above, and those skilled in the art may also obtain the to-be-processed commodity information in other manners as long as it is ensured that the commodity classification device can stably obtain the to-be-processed commodity information, which is not described herein again.

In addition, in order to accurately realize the commodity classification operation, a graphic fusion feature model for performing feature extraction operation on the commodity information is configured in advance, the graphic fusion feature model can be used for extracting graphic fusion features of the commodity information, and the obtained graphic fusion features are obtained by performing feature extraction and fusion operation on image information and text information in the commodity information. Specifically, after the commodity information is acquired, the commodity information (image information and text information) may be input to the image-text fusion feature model, so that the image-text fusion feature output by the image-text fusion feature model may be acquired.

In specific implementation, the image-text fusion feature model may include a self-developed super-large-scale chinese pre-training model (Multi-modeling to Multi-modeling Multitask Mega-transformer, abbreviated as M6), and of course, the image-text fusion feature model may also be a model of other image-text fusion features capable of extracting commodity information, for example: the image-text fusion feature model may also be a fusion model obtained by performing fusion training on the image feature extraction model and the text feature extraction model, or the like.

For the commodity information, since the commodity information includes the image information and the text information, in order to improve the accurate classification operation of the commodity information, the feature extraction operation can be performed on the image information in the commodity information by using a pre-configured image feature extraction model, so that the image feature corresponding to the commodity information can be obtained. Specifically, after the commodity information is acquired, the image information in the commodity information may be input to the image feature extraction model, so that the image feature output by the image feature extraction model may be obtained.

In a specific implementation, the image feature extraction model may include a depth Residual Network model (Resnet for short), and of course, the image feature extraction model may be another model capable of extracting image features of image information, for example: the image feature extraction model can be a model obtained by training a deep learning model VGG model, a residual error module and a pyramid pooling module, and the like.

In order to further improve the accurate classification operation of the commodity information, the feature extraction operation may be performed on the text information in the commodity information by using a pre-configured text feature extraction model, so that the text feature corresponding to the commodity information may be obtained. Specifically, after the commodity information is acquired, the text information in the commodity information may be input to the text feature extraction model, so that the text feature output by the text feature extraction model may be acquired.

In a specific implementation, the text feature extraction model may include a pre-trained language Representation model (Bert for short), and of course, the text feature extraction model may be other models capable of extracting text features of text information.

After the image-text fusion feature, the image feature and the text feature are obtained, in order to accurately classify the commodity information, the image-text fusion feature, the image feature and the text feature can be processed by using the deep neural network model, so that the category information corresponding to the commodity information output by the deep neural network model can be obtained. In some examples, the processing the image-text fusion feature, the image feature and the text feature by using the deep neural network model to obtain the category information corresponding to the commodity information may include: splicing the image-text fusion characteristics, the image characteristics and the text characteristics to obtain spliced characteristics; and processing the spliced features by using the deep neural network model to obtain the category information corresponding to the commodity information.

Specifically, after the deep neural network model obtains the image-text fusion feature, the image feature and the text feature, in order to comprehensively realize the identification operation of the commodity class based on the image-text fusion feature, the image feature and the text feature, the image-text fusion feature, the image feature and the text feature may be spliced first, so that the spliced feature may be obtained, and it should be noted that an execution subject of the splicing process may be the deep neural network model or a commodity classification device. After the spliced features are obtained, the spliced features can be processed by using the deep neural network model, so that the purpose of obtaining the category information corresponding to the commodity information based on the image-text fusion features, the image features and the text features is effectively achieved, and the accuracy and the reliability of obtaining the category information are improved.

In addition, because the image-text fusion feature, the image feature and the text feature are respectively obtained based on different feature extraction models, at this time, feature dimensions corresponding to the obtained image-text fusion feature, the image feature and the text feature may be the same or different. When the feature dimensions corresponding to the obtained image-text fusion feature, the image feature and the text feature are different, in order to perform relatively accurate article identification operation on the commodity information, the image-text fusion feature, the image feature and the text feature need to be aligned. At this time, before processing the image-text fusion feature, the image feature, and the text feature by using the deep neural network model to obtain the category information corresponding to the commodity information, the method in this embodiment may further include: identifying whether the dimensionality of the image-text fusion characteristic, the dimensionality of the image characteristic and the dimensionality of the text characteristic are consistent or not; and when the dimensionality of the image-text fusion feature, the dimensionality of the image feature and the dimensionality of the text feature are not consistent, adjusting the dimensionality of the image-text fusion feature, the dimensionality of the image feature and the dimensionality of the text feature to be consistent.

Specifically, after the image-text fusion feature, the image feature and the text feature are obtained, the dimension of the image-text fusion feature, the dimension of the image feature and the dimension of the text feature may be obtained first, and then the dimension of the image-text fusion feature, the dimension of the image feature and the dimension of the text feature are analyzed and compared to identify whether the dimension of the image-text fusion feature, the dimension of the image feature and the dimension of the text feature are consistent. When the dimensionality of the image-text fusion feature and the dimensionality of the image feature are the same as the dimensionality of the text feature, the image-text fusion feature dimensionality and the image feature dimensionality can be determined to be consistent with the dimensionality of the text feature; when the dimensionality of the image-text fusion feature and the dimensionality of the image feature are different from the dimensionality of the text feature, the fact that the dimensionality of the image-text fusion feature and the dimensionality of the image feature are inconsistent with the dimensionality of the text feature can be determined.

When the dimensionality of the image-text fusion feature, the dimensionality of the image feature and the dimensionality of the text feature are not consistent, in order to improve the accuracy and reliability of commodity class identification, the dimensionality of the image-text fusion feature, the dimensionality of the image feature and the dimensionality of the text feature can be adjusted to be consistent. For example, when the image-text fusion feature is a 1024-dimensional feature, the image feature is a 512-dimensional feature, and the text feature is a 768-dimensional feature, since the image-text fusion feature is 1024-dimensional and is the highest dimension, the dimension of the image feature and the dimension of the text feature can be adjusted to 1024-dimensional, thereby effectively realizing that the dimension of the image-text fusion feature, the dimension of the image feature, and the dimension of the text feature are adjusted to be consistent.

According to the commodity classification method provided by the embodiment, to-be-processed commodity information is acquired; then, feature extraction operation is carried out on the image information and the text information in the commodity information by using the image-text fusion feature model, and image-text fusion features corresponding to the commodity information are obtained; similarly, the image feature extraction model can be used for carrying out feature extraction operation on the image information in the commodity information to obtain image features; and performing feature extraction on the text information by using a text feature extraction model to obtain text features corresponding to the commodity information. After the image-text fusion characteristics, the image characteristics and the text characteristics are obtained, the image-text fusion characteristics, the image characteristics and the text characteristics are processed by using the deep neural network model, so that the commodity classification information corresponding to the commodity information can be stably obtained, the different characteristics obtained by fusing various models are effectively fused to carry out commodity classification operation, the accuracy and the reliability of identifying the commodity categories are effectively improved, and the practicability of the method is further improved.

Fig. 3 is a schematic flow chart illustrating a process of extracting features of commodity information by using a graph-text fusion feature model to obtain graph-text fusion features corresponding to the commodity information according to an embodiment of the present invention; fig. 4 is a schematic diagram illustrating a principle of extracting features of commodity information by using a graph-text fusion feature model to obtain graph-text fusion features corresponding to the commodity information according to an embodiment of the present invention; on the basis of the foregoing embodiment, with continued reference to fig. 3 to 4, the embodiment provides an implementation manner for performing feature extraction on commodity information by using a text-text fusion feature model, and specifically, in the embodiment, performing feature extraction on commodity information by using a text-text fusion feature model to obtain text-text fusion features corresponding to the commodity information may include:

step S301: and performing segmentation processing on the image information to obtain a plurality of sub-images corresponding to the image information.

In order to improve the accuracy and reliability of feature extraction on the product information, the image information included in the product information may be subjected to a segmentation operation, so that a plurality of sub-images corresponding to the image information may be obtained. In some examples, the slicing process may include: the method includes the steps of obtaining a segmentation parameter for performing segmentation processing on image information, specifically, the segmentation parameter may include the number of sub-images after segmentation or the image size of the sub-images, and then obtaining a plurality of sub-images corresponding to the image information based on the segmentation parameter.

Step S302: the image position corresponding to each of the plurality of sub-images is determined.

After acquiring the plurality of sub-images, in order to accurately process the plurality of sub-images, image positions corresponding to the plurality of sub-images may be determined. In some examples, determining the image position for each of the plurality of sub-images may include: acquiring a preset segmentation order (left-to-right order, top-to-bottom order, and the like) of a plurality of sub-images; image positions corresponding to the sub-images are determined based on a preset slicing sequence. In other examples, after obtaining the plurality of sub-images, respective image locations of the plurality of sub-images may be determined based on image subjects to which the plurality of sub-images correspond, and the image locations may be determined based on positional information of the sub-images relative to the image subjects.

Step S303: and performing word segmentation processing on the text information to obtain a plurality of word segmentation sequences corresponding to the text information.

In order to improve the accuracy and reliability of feature extraction on the product information, the text information included in the product information may be subjected to word segmentation processing operation, so that a plurality of word segmentation sequences corresponding to the text information may be obtained. In some examples, performing word segmentation processing on the text information to obtain a plurality of word segmentation sequences corresponding to the text information may include: semantic information corresponding to the text information is acquired, and word segmentation processing is performed on the text information based on the semantic information, so that a plurality of word segmentation sequences corresponding to the text information can be acquired.

Step S304: and determining word cutting positions corresponding to the word cutting sequences respectively.

After the multiple word segmentation sequences are obtained, the multiple word segmentation sequences may be analyzed to determine word segmentation positions corresponding to the multiple word segmentation sequences. In some examples, determining the respective word segmentation position for each of the plurality of word segmentation sequences may include: the character sequences of the word cutting sequences in the text information are obtained, and the word cutting positions of the word cutting sequences are determined based on the character sequences of the word cutting sequences in the text information, so that the accuracy and reliability of determining the word cutting positions are effectively guaranteed. In other examples, determining the word segmentation position corresponding to each of the plurality of word segmentation sequences may include: obtaining word segmentation semantics corresponding to the word segmentation sequences respectively; and determining word cutting positions corresponding to the plurality of word cutting sequences respectively based on the word cutting semantics corresponding to all the word cutting sequences.

Step S305: and performing feature extraction on the plurality of sub-images, the image positions, the plurality of word cutting sequences and the word cutting positions by using the image-text fusion feature model to obtain image-text fusion features corresponding to the commodity information.

After the plurality of sub-images, the image positions, the plurality of word cutting sequences and the word cutting positions are obtained, feature extraction operation can be performed on the plurality of sub-images, the image positions, the plurality of word cutting sequences and the word cutting positions by using the image-text fusion feature model, so that image-text fusion features corresponding to commodity information can be obtained.

In other examples, in this embodiment, the performing feature extraction on the plurality of sub-images, the image positions, the plurality of word segmentation sequences, and the word segmentation positions by using the image-text fusion feature model to obtain the image-text fusion features corresponding to the commodity information may include: covering parts of the word segmentation sequences to obtain part of the word segmentation sequences; and performing feature extraction on the plurality of sub-images, the image positions, the plurality of word cutting sequences, the partial word cutting sequences and the word cutting positions by using the image-text fusion feature model to obtain image-text fusion features corresponding to the commodity information.

Specifically, in order to further improve the accuracy and reliability of recognition of the image-text fusion features and the generalization of the image-text fusion feature model, after the word segmentation sequences are obtained, the parts of the word segmentation sequences can be masked, so that partial word segmentation sequences can be obtained, and it can be understood that the semantics corresponding to the partial word segmentation sequences are part of the semantics corresponding to all the word segmentation sequences.

In order to ensure the accuracy and reliability of extracting the image-text fusion characteristics, the image-text fusion characteristic model can be used for extracting the characteristics of a plurality of sub-images, image positions, a plurality of word cutting sequences, partial word cutting sequences and word cutting positions, so that the image-text fusion characteristics corresponding to commodity information can be obtained, and the accuracy and reliability of identifying the image-text fusion characteristics can be improved.

For example, referring to fig. 4, taking an M6 model as an example of a text-text fusion feature model, when the image information in the commodity information is an electric cooker image, after the electric cooker image is acquired, the electric cooker image may be divided into 4 sub-images, and image positions corresponding to the 4 sub-images may be acquired; after the text information in the commodity information is acquired, word segmentation processing may be performed on the text information, so that word segmentation positions corresponding to a plurality of word segmentation sequences and a plurality of word segmentation sequences may be obtained, for example: the plurality of word segmentation sequences may include a word segmentation sequence a, a word segmentation sequence B, and a word segmentation sequence C. In addition, after obtaining the plurality of word cutting sequences, the plurality of word cutting sequences may be subjected to a masking process, so that a masked partial word cutting sequence may be obtained, for example, the partial word cutting sequence may include a word cutting sequence a.

After the plurality of sub-images, the plurality of word cutting sequences, the partial word cutting sequences, the image positions and the word cutting positions are obtained, the plurality of sub-images, the plurality of word cutting sequences, the partial word cutting sequences, the image positions and the word cutting positions can be input into the image-text fusion feature model, and specifically, feature extraction operation is carried out on the plurality of sub-images, the plurality of word cutting sequences, the partial word cutting sequences, the image positions and the word cutting positions by utilizing a coding module, a multi-head attention network layer, a normalization network layer and a feedforward network layer in the image-text fusion feature model, so that image-text fusion features corresponding to commodity information can be obtained, and the accuracy and reliability of determining the image-text fusion features are effectively guaranteed.

In the embodiment, the image information is segmented to obtain a plurality of sub-images corresponding to the image information, image positions corresponding to the sub-images are determined, then word segmentation processing is performed on the text information to obtain a plurality of word segmentation sequences corresponding to the text information, word segmentation positions corresponding to the word segmentation sequences are determined, and feature extraction is performed on the sub-images, the image positions, the word segmentation sequences and the word segmentation positions by using the image-text fusion feature model, so that the stability and reliability of obtaining image-text fusion features corresponding to the commodity information are effectively ensured, and the accuracy of commodity class identification is further improved.

Fig. 5 is a schematic flow chart of another method for classifying commodities according to an embodiment of the present invention; on the basis of the above embodiment, referring to fig. 5, before processing the image-text fusion feature, the image feature, and the text feature by using the deep neural network model to obtain the category information corresponding to the commodity information, the method in this embodiment may further include:

step S501: and determining a feature central point corresponding to the commodity information according to the image-text fusion feature, the image feature and the text feature.

The image-text fusion feature, the image feature and the text feature are features which are obtained based on different feature extraction models and are used for identifying commodity information, feature extraction emphasis points corresponding to the image-text fusion feature, the image feature and the text feature are different, and in order to balance emphasis points corresponding to the image-text fusion feature, the image feature and the text feature, after the image-text fusion feature, the image feature and the text feature are obtained, the image-text fusion feature, the image feature and the text feature can be analyzed to determine a feature central point corresponding to the commodity information.

Specifically, the embodiment does not limit the specific implementation manner of determining the feature center point corresponding to the commodity information according to the image-text fusion feature, the image feature and the text feature, in some examples, a machine learning model for determining the feature center point of the commodity information is trained in advance, and after the image-text fusion feature, the image feature and the text feature are obtained, the image-text fusion feature, the image feature and the text feature can be input to the machine learning model, so that the feature center point output by the machine learning model can be obtained.

In other examples, determining the feature center point corresponding to the commodity information according to the image-text fusion feature, the image feature and the text feature may include: and determining the feature average value of the image-text fusion feature, the image feature and the text feature as a feature central point corresponding to the commodity information. For example, when the teletext feature fea1 is (x 1, y1, z 1), the text feature fea2 is (x 2, y2, z 2), and the image feature fea3 is (x 3, y3, z 3), the feature center point center = (fea 1+ fea2+ fea 3)/3 = (fea 1+ fea2+ fea 3) =: (x 1, y 1), and the text feature fea2 is a text feature fea2

). For example, fea1 is (3, 1, 5), fea2 is (1, 1, 3), fea3 is (2, 1, 1), the average value of fea1, fea2 and fea3 is obtained, the feature center is (2, 1, 3), and thus the accuracy and reliability of determining the feature center are effectively ensured.

Step S502: and determining target image-text characteristics, target image characteristics and target text characteristics corresponding to the commodity information based on the image-text fusion characteristics, the image characteristics, the text characteristics and the characteristic central points.

After the image-text fusion feature, the image feature, the text feature and the feature center point are obtained, the image-text fusion feature, the image feature, the text feature and the feature center point may be analyzed to determine a target image-text feature, a target image feature and a target text feature corresponding to the commodity information, where the target image-text feature may be obtained based on analyzing the image-text fusion feature and the feature center point, the target image feature may be obtained based on analyzing the image feature and the feature center point, and the target text feature may be obtained based on analyzing the text feature and the feature center point.

In some examples, determining the target teletext feature, the target image feature, and the target text feature corresponding to the merchandise information based on the teletext feature, the image feature, the text feature, and the feature center point may include: acquiring a first distance between the image-text fusion feature and the feature central point, a second distance between the image feature and the feature central point, and a third distance between the text feature and the feature central point; respectively determining a first probability, a second probability and a third probability corresponding to the image-text fusion feature, the image feature and the text feature based on the first distance, the second distance and the third distance; and determining a target image-text characteristic, a target image characteristic and a target text characteristic corresponding to the commodity information based on the first probability, the second probability, the third probability, the image-text fusion characteristic, the image characteristic and the text characteristic.

Specifically, after the image-text fusion feature, the image feature, the text feature and the feature center point are obtained, a first distance between the image-text fusion feature and the feature center point, a second distance between the image feature and the feature center point, and a third distance between the text feature and the feature center point may be respectively obtained, and after the first distance, the second distance and the third distance are obtained, a first probability, a second probability and a third probability corresponding to the image-text fusion feature, the image feature and the text feature may be respectively determined. In a specific implementation, the first probability is related to the first distance, the second probability is related to the second distance, and the third probability is related to the third distance, in some examples, the first probability is negatively related to the first distance, i.e., the larger the first distance, the lower the first probability, the smaller the first distance, and the higher the first probability; similarly, the second probability is negatively correlated with the second distance, i.e., the larger the second distance, the lower the second probability, and the smaller the second distance, the higher the second probability; the third probability is inversely related to the third distance, i.e. the larger the third distance, the lower the third probability, the smaller the third distance, the higher the third probability.

After the first probability, the second probability, the third probability, the image-text fusion feature, the image feature and the text feature are obtained, the first probability, the second probability, the third probability, the image-text fusion feature, the image feature and the text feature can be analyzed and processed to determine a target image-text feature, a target image feature and a target text feature corresponding to the commodity information. In some examples, a machine learning model for determining the target image-text feature, the target image feature and the target text feature is trained in advance, and after the first probability, the second probability, the third probability, the image-text fusion feature, the image feature and the text feature are obtained, the first probability, the second probability, the third probability, the image-text fusion feature, the image feature and the text feature can be input into the machine learning model, so that the target image-text feature, the target image feature and the target text feature corresponding to the commodity information can be obtained.

In other examples, determining the target teletext feature, the target image feature, and the target text feature corresponding to the merchandise information based on the first probability, the second probability, the third probability, the teletext feature, the image feature, and the text feature may include: determining the product value between the first probability and the image-text fusion characteristic as a target image-text characteristic; determining the product value between the second probability and the image characteristic as a target image characteristic; and determining the product value between the third probability and the text feature as the target text feature.

For example, when the first probability is P1 and the text-to-text feature is fea1, P1 × fea1 may be determined as the target text feature; when the second probability is P2, the image feature fea2, then P2 × fea2 may be determined as the target image feature; when the third probability is P3 and the text feature fea3, P3 × fea3 can be determined as the target text feature, so that the accuracy and reliability of determining the target image-text feature, the target image feature and the target text feature are effectively ensured.

In the embodiment, the feature central point corresponding to the commodity information is determined through the image-text fusion feature, the image feature and the text feature, and then the target image-text feature, the target image feature and the target text feature corresponding to the commodity information are determined based on the image-text fusion feature, the image feature, the text feature and the feature central point, so that the accuracy and reliability of determining the target image-text feature, the target image feature and the target text feature are effectively ensured, the commodity classification and identification operation of the deep learning network model based on the target image-text feature, the target image feature and the target text feature is facilitated, and the accuracy degree of commodity category identification is improved.

Fig. 6 is a schematic flow chart illustrating another method for classifying commodities according to an embodiment of the present invention; on the basis of the foregoing embodiment, referring to fig. 6, the present embodiment provides an implementation manner for configuring a deep neural network model, and specifically, before processing image-text fusion features, image features, and text features by using the deep neural network model to obtain category information corresponding to commodity information, the method in the present embodiment may further include:

step S601: at least two network layers for defining a deep neural network model are obtained.

Before processing the image-text fusion feature, the image feature and the text feature by using the deep neural network model to obtain the category information corresponding to the commodity information, a user can configure the deep neural network model according to requirements, at the moment, at least two network layers for limiting the deep neural network model can be obtained, and in general, the deep neural network model can comprise two network layers or three network layers.

Step S602: and determining the number of hidden nodes included in each network layer, wherein the number of hidden nodes included in the next network layer is less than the number of hidden nodes included in the previous network layer.

After obtaining at least two network layers for defining the deep neural network model, the number of hidden nodes included in each network layer may be determined, and in some examples, the number of hidden nodes included in each network layer may be a preset default number, where the number of hidden nodes included in a subsequent network layer is less than the number of hidden nodes included in a previous network layer, for example: the number of hidden nodes of the first network layer is 8192, and the number of hidden nodes of the second network layer is 4096.

Step S603: and determining a deep neural network model for processing the image-text fusion feature, the image feature and the text feature based on the number of hidden nodes included in each network layer.

After the number of hidden nodes included in each network layer is obtained, a deep neural network model for processing the image-text fusion feature, the image feature and the text feature is determined based on the number of hidden nodes included in each network layer, so that feature extraction operation is performed on the image-text fusion feature, the image feature and the text feature by using the deep neural network model, and the commodity category output by the deep neural network model can be obtained.

In this embodiment, the number of hidden nodes included in each network layer is determined by obtaining at least two network layers for defining a deep neural network model, and then the deep neural network model for processing the image-text fusion feature, the image feature and the text feature is determined based on the number of hidden nodes included in each network layer, so that the deep neural network model is effectively configured, and then the deep neural network model is conveniently used for identifying the commodity class, thereby further improving the practicability of the method.

Fig. 7 is a schematic flowchart of another method for classifying commodities according to an embodiment of the present invention; on the basis of the foregoing embodiment, referring to fig. 7, in this embodiment, through an implementation manner of defining nodes in a deep neural network model, to improve accuracy and reliability of feature extraction operations performed by the deep neural network model, after obtaining at least two network layers for defining the deep neural network model, the method in this embodiment may further include:

step S701: and acquiring the connection probability between the nodes of the network layer and the nodes in the adjacent network.

Step S702: and determining target nodes used for connecting with the nodes in the network layer in the adjacent network layer based on the connection probability, wherein the target nodes at least comprise part of nodes in the adjacent network layer.

Wherein, for each network layer included in the deep neural network model, each network layer may include a plurality of nodes for implementing the feature extraction operation, each node in the network layer may be communicatively connected with all nodes in the adjacent network layer, in order to improve the quality and efficiency of the feature extraction operation, each node in the network layer may be communicatively connected with a part of nodes in the adjacent network layer, a connection probability between the node in the network layer and the node in the adjacent network may be obtained, the connection probability is used to define the number of nodes in the neighboring network for connecting to the upper node, the connection probability may be a default value configured in advance, or, the connection probability may be configured based on an execution operation input by a user, of course, the user may configure the connection probability between the network layer node and the node according to different requirements.

In addition, for the connection probability corresponding to the node in each network layer, the connection probability corresponding to the node in different network layers may be different, for example: the connection probability P1 corresponding to the node in the first network layer is 0.7, P1 is used for limiting the number of nodes used for being in communication connection with the current node in the next layer to be 70% of the number of nodes in the next network layer, the connection probability P2 corresponding to the node in the second network layer is 0.3, and P2 is used for limiting the number of nodes used for being in communication connection with the current node in the next layer to be 30% of the number of nodes in the next network layer.

Since the connection probability is used to define the number of target nodes in the adjacent network layer for connecting with nodes in the network layer, after the connection probability is obtained, the target nodes in the adjacent network layer for connecting with nodes in the network layer may be determined based on the connection probability, and the target nodes include at least a part of nodes in the adjacent network layer.

In some examples, because the connection probability and the number of target nodes in the adjacent network layer for connecting with the nodes in the network layer are obtained, after the connection probability is obtained, the number of target nodes in the adjacent network layer for connecting with the nodes in the network layer can be directly determined based on the connection probability, and then the number of target nodes in the adjacent network layer for connecting with the nodes in the network layer can be determined based on the number of the target nodes, so that the accuracy and reliability of determining the target nodes are effectively ensured.

In other examples, determining a target node in a neighboring network layer for connecting to a node in the network layer based on the connection probability may include: acquiring a plurality of node sets in adjacent network layers, wherein each node set comprises at least two nodes; and determining target nodes used for being connected with the nodes in the network layer in the adjacent network layer based on the connection probability and the plurality of node sets, wherein any two target nodes have different node sets.

Specifically, for a plurality of nodes included in a network layer in the deep neural network model, the plurality of nodes may be divided into a plurality of node sets, and each node set may include two or more nodes; in order to improve the flexible reliability of determining a target node used for being connected with a node in a network layer in an adjacent network layer, a plurality of node sets in the adjacent network layer can be obtained firstly, and each node set can be correspondingly provided with a set identifier; then, a target node used for being connected with a node in a network layer in an adjacent network layer can be determined based on the connection probability and the plurality of node sets, specifically, the target node can be selected and determined in each node set based on the connection probability, at this time, any two determined target nodes correspond to different node sets, and therefore the uniformity of determination of the target nodes is guaranteed.

In the embodiment, the connection probability between the nodes of the network layer and the nodes in the adjacent network is obtained, and then the target nodes used for being connected with the nodes in the network layer in the adjacent network layer are determined based on the connection probability, so that the configuration operation of the internal framework in the deep neural network model is effectively realized, the deep neural network model can accurately identify the commodity class, different requirements of different users are met, and the flexibility and the reliability of the method are further improved.

When the method is applied specifically, the M6 is taken as an image-text fusion feature model, the Bert text model is taken as a text feature extraction model, and the Resnet image model is taken as an image feature extraction model, the application embodiment provides a commodity classification method, the technical scheme is derived from application scenes such as actual searching and price comparison of commodities of the same type in the E-commerce field, the problems that the commodity recall rate is insufficient in the same-style identification due to the phenomenon of wrong hanging, disordered hanging and the like in the commodity class hanging in the prior art are solved, and the method has actual application value. According to the technical scheme, the image-text fusion characteristics, the text characteristics and the image characteristics of the commodity information can be extracted by using an M6 model, a Bert text model and a Resnet image model, and the image-text fusion characteristics, the text characteristics and the image characteristics are fused by using a DNN model, so that the identification effect of commodity classification is improved, and finally the commodity information belonging to the commodity information can be determined. Specifically, referring to fig. 8, the method for classifying commodities includes the following steps:

step 1: and acquiring commodity image-text data to be processed, wherein the commodity image-text data comprises image information and text information.

Step 2: and inputting the commodity image-text data into an M6 model, wherein the M6 model can process the commodity image-text data so as to obtain an image-text fusion feature M6 output by the M6 model, and the image-text fusion feature M6 is obtained by fusing an image feature and a text feature.

Before the M6 model is used to process the commodity image-text data, model training operation may be performed to obtain an M6 model, specifically, in order to obtain an M6 model, a commodity category label set may be obtained first, text information and image information corresponding to the commodity category label set are input to a pre-trained model M6, and then an M6 model is trained based on a fine-tuning model Fineturn, so that a trained M6 model may be obtained, and the trained M6 model may implement extraction operation of image-text fusion features.

It should be noted that the M6 model not only can implement the feature extraction operation on the commodity image-text data, but also can implement the preliminary item identification operation on the commodity image-text data, so as to obtain the preliminary item information corresponding to the commodity image-text data, and it is only not necessary to perform the corresponding data processing operation on the basis of the preliminary item information obtained by the M6 model in this embodiment.

Specifically, the processing of the commodity image-text data by the M6 model may include: cutting the image information to obtain a plurality of image blocks, and then determining the image positions corresponding to the plurality of image blocks; performing word segmentation on the text information to obtain a plurality of text sequences, and simultaneously inputting the text information by dividing the text information into two parts, wherein one part is coding information corresponding to the plurality of text sequences after the word segmentation, and the other part is partial covering information corresponding to the plurality of text sequences after the word segmentation; then determining the text positions corresponding to the plurality of text sequences, processing the plurality of image blocks, the image positions, the plurality of text sequences and the text positions by using the M6 model, specifically, a coding module in the M6 model can perform coding operation on the plurality of image blocks, the image positions, the plurality of text sequences and the text positions, then the coded information is sent to a multi-layer Transformer structure together for model training operation, the output part can be the type of item labels corresponding to the commodity data, in the embodiment, the output layer in the M6 model outputs the feature Fea1, the total dimension may be 1024 dimensions, 512 dimensions or other dimensions and the like, therefore, the image-text fusion feature M6 can be obtained, and it should be noted that the robustness of the M6 model can be effectively improved by masking part of the text, which is beneficial to improving the recognition effect of the M6 model.

And step 3: and inputting the commodity image-text data into a Bert model, wherein the Bert model can process text information in the commodity image-text data so as to obtain the text Bert features output by the Bert model.

Before the feature extraction operation is performed by using the Bert model, the training operation of the Bert model may be performed, specifically, the set of commodity category labels may be obtained first, the text of the commodity is input into the pre-training model Bert, and the model Fineturn is fine-tuned to train the Bert model, so that the trained Bert model may be obtained, and the trained Bert model may implement the extraction operation of the text features.

Specifically, as shown in fig. 9, the correlation structure of the Bert model is obtained by performing all word segmentation on text information in the commodity data to obtain encoded information, and inputting the encoded information into a multi-layer Transformer structure to perform model training. The Output part is the category of item labels corresponding to the Bert model, in the invention, the Output of the Output Layer obtained from the Bert model is the characteristic Fea2, and the characteristic dimension can be 768 or 384.

And 4, step 4: the commodity image-text data is input into the Resnet image model, and the Resnet image model can process the image information in the commodity image-text data, so that the Resnet characteristics of the image output by the Resnet image model can be obtained.

Before processing the image information in the commodity image-text data by using the Resnet image model, the training operation of the Resnet image model can be performed, specifically, a commodity category label set can be obtained first, the image data of the commodity is input into the pre-training model Resnet, and the model Fineturn is finely adjusted to train the Resnet image model, so that the trained Resnet image model can be obtained, and the trained Resnet image model can realize the extraction operation of the image characteristics.

Specifically, as shown in fig. 10, the related structure of the Resnet image model is that image information in commodity data is completely cut to obtain encoded information, and the encoded information is input into the multilayer convolutional layer N, so that a feature Fea3 output after the convolutional layer N can be obtained, and the feature dimension can be 2048, 1024, 512, or the like.

In other examples, when the feature extraction operation is performed, the rapid access or exit operation of a plurality of feature extraction models can be allowed, and the plurality of feature extraction models can be added to the feature information of the commodity image-text data at any time based on the design requirements of the user, so that a plurality of quantities of feature information can be obtained; in addition, the feature extraction model for obtaining the feature information of the commodity image-text data can be deleted at any time according to the design requirements of the user, so that different design requirements of different users can be met.

And 5: and (3) utilizing a deep neural network model DNN to splice the M6 models in Step1, Step2 and Step3 and the features Fea1, Fea2 and Fea3 respectively extracted from the Bert model and the Resnet model to obtain splicing features [ Fea1, Fea2 and Fea3], wherein the total feature length is 3840 dimensions, and the category information of the commodity image-text data can be determined based on the splicing features.

The deep neural network model DNN may be a 2-layer DNN network, the number of hidden nodes in each network layer may be 8192 and 4096, generally, the number of hidden nodes in a subsequent network layer is the number of hidden nodes in a previous network layer, generally, the number of hidden nodes in a subsequent network layer may be one half or one quarter of the number of hidden nodes in a previous network layer, and the number of hidden nodes in each network layer may be preferably a power of 2. In addition, the activation function output by the network layer and the commodity category can adopt a classic Relu function.

In addition, in the process of splicing and fusing the feature Fea1, the feature Fea2 and the feature Fea3 by using a DNN network, the DNN may adopt a drop method Dropout operation, that is, the values of the probability P2 and the probability P1 in fig. 11, and the values of the probability P2 and the probability P1 are used to define how many probability nodes are selected for connection, in general, P1 may be 70%, and P2 may be 30%; alternatively, P1 may be 30% and P2 may be 70%. And finally, retraining according to the class marking data of the commodity, and solving the model parameter W in the DNN network model by adopting a gradient descent method, so that the generalization of the DNN network model can be improved, and the DNN network model can be better learned.

In some other examples, in the process of processing the spliced features by using the deep neural network model DNN to determine the category information of the commodity image-text data, other technologies may be further fused to perform the identification operation of the commodity category, for example: the DNN network model can be used for identifying the commodity class by combining an attention technology and a transformation technology, so that the class information of the commodity image-text data can be obtained.

In still other examples, taking M6_ Fea1= (3, 1, 5), Bert _ Fea2= (1, 1, 3), and Resnet _ Fea3= (2, 1, 1) as examples, after obtaining the above M6_ Fea1, Bert _ Fea2, and Resnet _ Fea3, the feature center point may be determined: center = (M6 _ Fea1+ Bert _ Fea2+ Resnet _ Fea 3)/3 = (2, 1, 3). Then | M6_ Fea1| =is calculated

=

，|Bert_Fea2|=

=

，|Resnet_Fea3|=

=

，|Center|=

=

。

M6_sim_score=M6_Fea1*Center/|M6_Fea1|*|Center|=（3*2+1*1+5*3）/（

）=0.99；

Bert_sim_score=Bert_Fea2*Center/|Bert_Fea2|*|Center|=（1*2+1*1+3*3）/（

）=0.967；

Resnet_sim_score=Resnet_Fea3*Center/|Resnet_Fea3|*|Center|=（2*2+1*1+1*3）/（

）=0.87；

M6_sim_weight=0.99/（0.99+0.967+0.87）=0.35；

Bert_sim_weight=0.967/（0.99+0.967+0.87）=0.34；

Resnet_sim_weight=0.87/（0.99+0.967+0.87）=0.31；

The fusion features that can then be obtained for input into the DNN network model are: m6_ sim _ weight M6_ Fea1+ Bert _ sim _ weight Bert _ Fea2+ Resnet _ sim _ weight Resnet _ Fea1, and then DNN model is used for processing the fusion characteristics, so that the information of the commodity class can be obtained.

According to the technical scheme provided by the application embodiment, aiming at the richness of the commodity image-text, the extraction operation of the commodity image-text characteristics is carried out through multi-mode fusion technologies such as an M6 model and the like, meanwhile, the characteristic extraction fusion is respectively carried out on the text characteristics and the image characteristics, and the Bert text characteristics and the Resnet image characteristics of the commodity are respectively extracted by combining a classic Bert model in the natural language processing field and a classic Resnet model in the image field, so that the fusion at the characteristic level is effectively realized, and the condition that special processing is needed when the classification results corresponding to the image characteristics and the text characteristics are inconsistent can be avoided; the Bert model extracts text features, and the product expression effect is superior to that of the TextCNN model; after the characteristics are spliced, the accuracy of commodity classification can be further improved through the DNN network model, so that the method is remarkably improved compared with a single M6 model, a Bert model and a Resnet model, is more robust in commodity characteristic expression, is favorable for improving the quality and effect of commodity identification, further improves the practicability of the method, and is favorable for market popularization and application.

Fig. 12 is a schematic structural diagram of a commodity sorting device according to an embodiment of the present invention; referring to fig. 12, the present embodiment provides a product sorting apparatus that can perform the product sorting method shown in fig. 2, and the product sorting apparatus may include:

the first acquisition module 11 is configured to acquire commodity information to be processed, where the commodity information includes image information and text information;

the first extraction module 12 is used for extracting the characteristics of the commodity information by the image-text fusion characteristic model to obtain image-text fusion characteristics corresponding to the commodity information;

the first extraction module 12 is further configured to perform feature extraction on the image information by using the image feature extraction model to obtain image features corresponding to the commodity information;

the first extraction module 12 is further configured to perform feature extraction on the text information by using the text feature extraction model to obtain text features corresponding to the commodity information;

and the first processing module 13 is configured to process the image-text fusion features, the image features, and the text features by using the deep neural network model, and obtain category information corresponding to the commodity information.

In some examples, when the first extraction module 12 performs feature extraction on the commodity information by using the image-text fusion feature model to obtain the image-text fusion feature corresponding to the commodity information, the first extraction module 12 is configured to perform: segmenting the image information to obtain a plurality of sub-images corresponding to the image information; determining image positions corresponding to the plurality of sub-images; performing word segmentation processing on the text information to obtain a plurality of word segmentation sequences corresponding to the text information; determining word cutting positions corresponding to the word cutting sequences respectively; and performing feature extraction on the plurality of sub-images, the image positions, the plurality of word cutting sequences and the word cutting positions by using the image-text fusion feature model to obtain image-text fusion features corresponding to the commodity information.

In some examples, when the first extraction module 12 performs feature extraction on the plurality of sub-images, the image positions, the plurality of word cutting sequences and the word cutting positions by using the image-text fusion feature model to obtain image-text fusion features corresponding to the commodity information, the first extraction module 12 is configured to perform: covering parts of the word segmentation sequences to obtain part of the word segmentation sequences; and performing feature extraction on the plurality of sub-images, the image positions, the plurality of word cutting sequences, the partial word cutting sequences and the word cutting positions by using the image-text fusion feature model to obtain image-text fusion features corresponding to the commodity information.

In some examples, before the image-text fusion feature, the image feature and the text feature are processed by using the deep neural network model to obtain the item information corresponding to the commodity information, the first processing module 13 in this embodiment is configured to perform: identifying whether the dimensionality of the image-text fusion characteristic, the dimensionality of the image characteristic and the dimensionality of the text characteristic are consistent or not; and when the dimensionality of the image-text fusion feature, the dimensionality of the image feature and the dimensionality of the text feature are not consistent, adjusting the dimensionality of the image-text fusion feature, the dimensionality of the image feature and the dimensionality of the text feature to be consistent.

In some examples, before the image-text fusion feature, the image feature and the text feature are processed by using the deep neural network model to obtain the item information corresponding to the commodity information, the first processing module 13 in this embodiment is configured to perform: determining a feature central point corresponding to the commodity information according to the image-text fusion feature, the image feature and the text feature; and determining target image-text characteristics, target image characteristics and target text characteristics corresponding to the commodity information based on the image-text fusion characteristics, the image characteristics, the text characteristics and the characteristic central points.

In some examples, when the first processing module 13 determines the feature center point corresponding to the commodity information according to the image-text fusion feature, the image feature and the text feature, the first processing module 13 is configured to perform: and determining the feature average value of the image-text fusion feature, the image feature and the text feature as a feature central point corresponding to the commodity information.

In some examples, when the first processing module 13 determines the target image-text feature, the target image feature and the target text feature corresponding to the commodity information based on the image-text fusion feature, the image feature, the text feature and the feature center point, the first processing module 13 is configured to perform: acquiring a first distance between the image-text fusion feature and the feature central point, a second distance between the image feature and the feature central point, and a third distance between the text feature and the feature central point; respectively determining a first probability, a second probability and a third probability corresponding to the image-text fusion feature, the image feature and the text feature based on the first distance, the second distance and the third distance; and determining a target image-text characteristic, a target image characteristic and a target text characteristic corresponding to the commodity information based on the first probability, the second probability, the third probability, the image-text fusion characteristic, the image characteristic and the text characteristic.

In some examples, when the first processing module 13 determines the target image-text feature, the target image feature and the target text feature corresponding to the commodity information based on the first probability, the second probability, the third probability, the image-text fusion feature, the image feature and the text feature, the first processing module 13 is configured to perform: determining the product value between the first probability and the image-text fusion characteristic as a target image-text characteristic; determining the product value between the second probability and the image characteristic as a target image characteristic; and determining the product value between the third probability and the text feature as the target text feature.

In some examples, when the first processing module 13 processes the image-text fusion feature, the image feature and the text feature by using the deep neural network model to obtain the category information corresponding to the commodity information, the first processing module 13 is configured to perform: splicing the image-text fusion characteristics, the image characteristics and the text characteristics to obtain spliced characteristics; and processing the spliced features by using the deep neural network model to obtain the category information corresponding to the commodity information.

In some examples, before the image-text fusion feature, the image feature and the text feature are processed by using the deep neural network model to obtain the item information corresponding to the commodity information, the first obtaining module 11 and the first processing module 13 in this embodiment may be further configured to perform the following steps:

a first obtaining module 11, configured to obtain at least two network layers for defining a deep neural network model;

a first processing module 13, configured to determine the number of hidden nodes included in each network layer, where the number of hidden nodes included in a subsequent network layer is less than the number of hidden nodes included in a previous network layer; and determining a deep neural network model for processing the image-text fusion characteristics, the image characteristics and the text characteristics based on the number of hidden nodes included in each network layer.

In some examples, after acquiring the at least two network layers for defining the deep neural network model, the first acquiring module 11 and the first processing module 13 in this embodiment may be further configured to perform the following steps:

a first obtaining module 11, configured to obtain a connection probability between a node in a network layer and a node in an adjacent network;

and the first processing module 13 is configured to determine, based on the connection probability, a target node in the adjacent network layer for connecting with a node in the network layer, where the target node includes at least a part of nodes in the adjacent network layer.

In some examples, when the first processing module 13 determines the target node in the adjacent network layer for connecting with the node in the network layer based on the connection probability, the first processing module 13 is configured to perform: acquiring a plurality of node sets in adjacent network layers, wherein each node set comprises at least two nodes; and determining target nodes used for being connected with the nodes in the network layer in the adjacent network layer based on the connection probability and the plurality of node sets, wherein any two target nodes have different node sets.

The apparatus shown in fig. 11 can perform the method of the embodiment shown in fig. 1-11, and the detailed description of this embodiment can refer to the related description of the embodiment shown in fig. 1-11. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 1 to fig. 11, and are not described herein again.

In one possible design, the structure of the article sorting apparatus shown in fig. 12 may be implemented as an electronic device, which may be a controller, a personal computer, a server, or other devices. As shown in fig. 13, the electronic device may include: a first processor 21 and a first memory 22. Wherein the first memory 22 is used for storing programs for corresponding electronic devices to execute the goods classification method provided in the embodiments shown in fig. 1-11, and the first processor 21 is configured for executing the programs stored in the first memory 22.

The program comprises one or more computer instructions, wherein the one or more computer instructions, when executed by the first processor 21, are capable of performing the steps of: acquiring commodity information to be processed, wherein the commodity information comprises image information and text information; feature extraction is carried out on the commodity information by utilizing the image-text fusion feature model, and image-text fusion features corresponding to the commodity information are obtained; performing feature extraction on the image information by using an image feature extraction model to obtain image features corresponding to the commodity information; performing feature extraction on the text information by using a text feature extraction model to obtain text features corresponding to the commodity information; and processing the image-text fusion characteristics, the image characteristics and the text characteristics by using the deep neural network model to obtain the category information corresponding to the commodity information.

Further, the first processor 21 is also used to execute all or part of the steps in the embodiments shown in fig. 1-11.

The electronic device may further include a first communication interface 23 for communicating with other devices or a communication network.

In addition, an embodiment of the present invention provides a computer storage medium for storing computer software instructions for an electronic device, which includes a program for executing the method for classifying items in the method embodiments shown in fig. 1 to 11.

Furthermore, an embodiment of the present invention provides a computer program product, including: a computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the steps in the method of classifying items in the method embodiments of fig. 1-11 described above.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information and/or information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of classifying a commodity, comprising:

2. The method of claim 1, wherein performing feature extraction on the commodity information by using a text-text fusion feature model to obtain text-text fusion features corresponding to the commodity information comprises:

performing segmentation processing on the image information to obtain a plurality of sub-images corresponding to the image information;

determining image positions corresponding to the plurality of sub-images;

performing word segmentation processing on the text information to obtain a plurality of word segmentation sequences corresponding to the text information;

determining word cutting positions corresponding to the word cutting sequences respectively;

and performing feature extraction on the plurality of sub-images, the image positions, the plurality of word cutting sequences and the word cutting positions by using the image-text fusion feature model to obtain image-text fusion features corresponding to the commodity information.

3. The method of claim 2, wherein the feature extraction of the plurality of sub-images, the image positions, the word segmentation sequences and the word segmentation positions by using the image-text fusion feature model to obtain image-text fusion features corresponding to the commodity information comprises:

covering parts of the word segmentation sequences to obtain partial word segmentation sequences;

and performing feature extraction on the plurality of sub-images, the image positions, the plurality of word cutting sequences, part of the word cutting sequences and the word cutting positions by using the image-text fusion feature model to obtain image-text fusion features corresponding to the commodity information.

4. The method of claim 1, wherein before processing the text, image and text fusion features using a deep neural network model to obtain category information corresponding to the merchandise information, the method further comprises:

identifying whether the dimensionality of the image-text fusion feature is consistent with the dimensionality of the text feature;

and when the dimensionality of the image-text fusion feature, the dimensionality of the image feature and the dimensionality of the text feature are not consistent, adjusting the dimensionality of the image-text fusion feature, the dimensionality of the image feature and the dimensionality of the text feature to be consistent.

5. The method of claim 1, wherein before processing the text, image and text fusion features using a deep neural network model to obtain category information corresponding to the merchandise information, the method further comprises:

determining a feature central point corresponding to the commodity information according to the image-text fusion feature, the image feature and the text feature;

and determining a target image-text characteristic, a target image characteristic and a target text characteristic corresponding to the commodity information based on the image-text fusion characteristic, the image characteristic, the text characteristic and the characteristic central point.

6. The method of claim 5, wherein determining a feature center point corresponding to the merchandise information according to the teletext feature, the image feature and the text feature comprises:

and determining the feature average value of the image-text fusion feature, the image feature and the text feature as a feature central point corresponding to the commodity information.

7. The method of claim 5, wherein determining a target teletext feature, a target image feature, and a target text feature corresponding to the merchandise information based on the teletext feature, the image feature, the text feature, and the feature center point comprises:

acquiring a first distance between the image-text fusion feature and the feature central point, a second distance between the image feature and the feature central point, and a third distance between the text feature and the feature central point;

respectively determining a first probability, a second probability and a third probability corresponding to the image-text fusion feature, the image feature and the text feature based on the first distance, the second distance and the third distance;

and determining a target image-text characteristic, a target image characteristic and a target text characteristic corresponding to the commodity information based on the first probability, the second probability, the third probability, the image-text fusion characteristic, the image characteristic and the text characteristic.

8. The method of claim 7, wherein determining a target teletext feature, a target image feature, and a target text feature corresponding to the merchandise information based on the first probability, the second probability, the third probability, the teletext feature, the image feature, and the text feature comprises:

determining the product value between the first probability and the image-text fusion characteristic as the target image-text characteristic;

determining a product value between the second probability and the image feature as the target image feature;

and determining the product value between the third probability and the text feature as the target text feature.

9. The method of claim 1, wherein the processing the image-text fusion feature, the image feature and the text feature by using a deep neural network model to obtain category information corresponding to the commodity information comprises:

splicing the image-text fusion characteristics, the image characteristics and the text characteristics to obtain spliced characteristics;

and processing the spliced features by utilizing a deep neural network model to obtain the category information corresponding to the commodity information.

10. The method of claim 1, wherein before processing the image-text fusion feature, the image feature and the text feature using the deep neural network model to obtain the category information corresponding to the commodity information, the method further comprises:

obtaining at least two network layers for defining the deep neural network model;

determining the number of hidden nodes included in each network layer, wherein the number of hidden nodes included in the next network layer is less than the number of hidden nodes included in the previous network layer;

and determining the deep neural network model for processing the image-text fusion feature, the image feature and the text feature based on the number of hidden nodes included in each network layer.

11. The method of claim 10, wherein after obtaining at least two network layers for defining the deep neural network model, the method further comprises:

acquiring the connection probability between the node of the network layer and the node in the adjacent network;

and determining a target node in the adjacent network layer for connecting with the nodes in the network layer based on the connection probability, wherein the target node at least comprises part of nodes in the adjacent network layer.

12. The method of claim 11, wherein determining a target node in a neighboring network layer for connecting to a node in the network layer based on the connection probability comprises:

acquiring a plurality of node sets in the adjacent network layer, wherein each node set comprises at least two nodes;

and determining target nodes used for being connected with the nodes in the network layer in the adjacent network layer based on the connection probability and the plurality of node sets, wherein any two target nodes have different node sets.

13. An electronic device, comprising: a memory, a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the processor, implement the method of classifying an item of merchandise according to any one of claims 1-12.

14. A computer storage medium storing a computer program which causes a computer to execute a method of classifying an article according to any one of claims 1 to 12.