CN114662492A

CN114662492A - Product word processing method and device, equipment, medium and product thereof

Info

Publication number: CN114662492A
Application number: CN202210398108.5A
Authority: CN
Inventors: 黄丕帅
Original assignee: Guangzhou Huanju Shidai Information Technology Co Ltd
Current assignee: Guangzhou Huanju Shidai Information Technology Co Ltd
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2022-06-24

Abstract

The application discloses a product word processing method and a device, equipment, medium and product thereof, wherein the method comprises the following steps: performing word segmentation processing on the commodity title to obtain a plurality of ordered word segments to form a word segment sequence; calculating the data distance between the semantic feature vector of each participle and the semantic feature vector of the commodity title, and correspondingly taking the data distance as the similarity score of each participle; according to the sorting information of the participles of the lemma in the hit preset product dictionary in the participle sequence, quantitatively determining the sorting score of the participles; and outputting the participle with the highest comprehensive score as the product word of the commodity title, wherein the comprehensive score is the sum of the similarity score and the sequencing score of the corresponding participle. According to the method and the device, the corresponding product words can be conveniently, efficiently and accurately determined from the given commodity titles, and basic services are provided for downstream tasks such as commodity search, commodity advertisement putting and commodity collection of the independent sites served by the E-commerce platform, so that the service experience of the E-commerce platform is improved.

Description

Product word processing method and device, equipment, medium and product thereof

Technical Field

The present application relates to the field of e-commerce information technologies, and in particular, to a product word processing method and a corresponding apparatus, computer device, computer readable storage medium, and computer program product.

Background

The method and the system serve the needs of commodity classification, commodity search, advertisement putting and the like of the E-commerce platform, related commodities are often required to be searched and determined from a commodity database according to given individual keywords, in practice, corresponding product words are predetermined for each commodity, and indexing and matching of the commodities are facilitated.

Determining product words for an item is typically based on the title of the item, for example:

one of the common ways is to match the product titles through preset product words, when the product titles contain the preset product words, the preset product words are the product words of the product titles, and in this way, the effect is limited because the pre-collected product words are difficult to cover the expression of massive real products.

The other mode is that semantic feature representation is carried out on the commodity titles based on a deep learning model and classification mapping is carried out, so that product words corresponding to each commodity title are determined.

Furthermore, under the cross-border e-commerce scene, each online shop of the e-commerce platform is deployed with an independent site, and in actual operation, when each independent site performs expression processing of information related to a commodity, such as a commodity title, the independent site often performs word organization according to a language expression habit of the independent site, so that a plurality of different texts may appear on the name of the same product, and the effect of identifying product words of the commodity title through the traditional technology is further reduced.

In summary, the conventional techniques have limited success in determining product words for product titles, and it is difficult to determine corresponding product words for product titles quickly and efficiently, and related techniques still have room for improvement.

Disclosure of Invention

A primary object of the present application is to solve at least one of the above problems and provide a product word processing method and a corresponding apparatus, computer device, computer readable storage medium, and computer program product.

In order to meet various purposes of the application, the following technical scheme is adopted in the application:

a product word processing method adapted to one of the objects of the present application includes the steps of:

performing word segmentation processing on the commodity title to obtain a plurality of ordered word segments to form a word segment sequence;

calculating the data distance between the semantic feature vector of each participle and the semantic feature vector of the commodity title, and correspondingly taking the data distance as the similarity score of each participle;

according to the sorting information of the participles of the lemma in the hit preset product dictionary in the participle sequence, quantitatively determining the sorting score of the participles;

and outputting the participle with the highest comprehensive score as the product word of the commodity title, wherein the comprehensive score is the sum of the similarity score and the sequencing score of the corresponding participle.

In a deepened part of embodiments, the method for segmenting the commodity title to obtain a plurality of segmented words and form a segmented word sequence includes the following steps:

acquiring a commodity title submitted by a user;

performing word segmentation on the commodity title by adopting a preset word segmentation algorithm to obtain a plurality of word segments;

and constructing the multiple participles into a participle sequence according to the sequence of the multiple participles in the commodity title, and representing the sequencing information of the participles through the sequencing values of the participles in the participle sequence.

In the deepened partial embodiment, the data distance between the semantic feature vector of each participle and the semantic feature vector of the commodity title is calculated and correspondingly used as the similarity score of each participle, and the method comprises the following steps:

performing word embedding on each participle and the commodity title respectively to obtain embedded vectors corresponding to each participle and the commodity title;

respectively representing and learning the embedded vectors corresponding to the participles and the commodity titles by adopting a text feature extraction model trained to a convergence state to obtain corresponding semantic feature vectors;

and calculating the data distance between the semantic feature vector of each participle and the semantic feature vector of the commodity title, and taking the data distance as the similarity score of the participle.

In a deepened partial embodiment, according to the sorting information of the participles hitting the lemmas in the preset product dictionary in the participle sequence, the method quantitatively determines the sorting scores of the participles, and comprises the following steps:

determining the commodity classification corresponding to the commodity title according to the semantic feature vector of the commodity title;

detecting whether each participle contains at least one lemma in a product dictionary preset corresponding to the commodity classification, and determining the participle as an optional participle hitting the lemma in the product dictionary when the participle contains the lemma;

and determining the sorting value of the selectable participles in the participle sequence, and setting the associated preset weight of the sorting value as the sorting score corresponding to the selectable participles.

In a deepened part of embodiments, the participle with the highest comprehensive score is output as the product word of the commodity title, and the method comprises the following steps:

calculating the sum of the similarity score and the sequencing score of each participle hitting the lemma of the product dictionary to obtain the comprehensive score of the participle;

reversely ordering each participle hitting the product dictionary according to the comprehensive score, and determining a first participle as a product word of the commodity title;

and outputting the product words.

In some embodiments of the present invention, before the step of ranking information in the segmentation sequence according to the segmentation that hits the lemma in the preset product dictionary, the method includes the following steps:

and extracting a plurality of lemmas from the product words which are pre-collected corresponding to each commodity classification, and storing the lemmas to construct a product dictionary of the corresponding commodity classification.

In some embodiments of the expansion, after the step of outputting the participle with the highest comprehensive score as the product word of the product title, the method further includes the following steps:

according to the product words of the commodity titles, target commodities with the product words consistent with the product words or similar in semantics are searched from a commodity database;

and pushing the commodity information of the target commodity to the terminal equipment submitting the commodity title.

The product word processing device comprises a word segmentation processing module, a similarity score module, a sorting score module and a word determination module, wherein: the word segmentation processing module is used for carrying out word segmentation processing on the commodity title to obtain a plurality of ordered word segmentations to form a word segmentation sequence; the similarity score module is used for calculating the data distance between the semantic feature vector of each participle and the semantic feature vector of the commodity title, and correspondingly serving as the similarity score of each participle; the sorting score module is used for quantitatively determining the sorting score according to the sorting information of the participles of the lemmas in the hit preset product dictionary in the participle sequence; the word-using determining module is used for outputting the word with the highest comprehensive score as the product word of the commodity title, and the comprehensive score is the sum of the similarity score and the sequencing score of the corresponding word.

In some embodiments of the deepening, the word segmentation processing module includes: the title acquisition unit is used for acquiring a commodity title submitted by a user; the word segmentation execution unit is used for segmenting the commodity title by adopting a preset word segmentation algorithm to obtain a plurality of words; and the sequencing representation unit is used for constructing the multiple participles into a participle sequence according to the sequence of the multiple participles in the commodity title, and representing the sequencing information of the participles through the sequencing values of the participles in the participle sequence.

In some embodiments of the deepening embodiment, the similarity score module includes: the encoding processing unit is used for respectively carrying out word embedding on each participle and the commodity title to obtain embedded vectors corresponding to each participle and the commodity title; the expression learning unit is used for adopting a text feature extraction model trained to a convergence state to respectively carry out expression learning on each participle and the embedded vector corresponding to the commodity title to obtain corresponding semantic feature vectors; and the distance calculation unit is used for calculating the data distance between the semantic feature vector of each participle and the semantic feature vector of the commodity title, and taking the data distance as the similarity score of the participle.

In some embodiments of the deepening, the ranking score module includes: the classification mapping unit is used for determining the commodity classification corresponding to the commodity title according to the semantic feature vector of the commodity title; the segmentation hit unit is used for detecting whether each segmentation contains at least one lemma in a product dictionary preset corresponding to the commodity classification, and when the segmentation contains the lemma, determining the segmentation as an optional segmentation hitting the lemma in the product dictionary; and the sorting score unit is used for determining a sorting value of the selectable participle in the participle sequence and setting the sorting value associated with a preset weight as a sorting score corresponding to the selectable participle.

In some embodiments of the deepening, the wording determination module includes: the score integration unit is used for calculating the sum of the similarity score and the sequencing score of each participle hitting the lemma of the product dictionary to obtain the integrated score of the participle; the sorting optimization unit is used for performing reverse sorting on each participle hitting the product dictionary according to the comprehensive score and determining that the first participle is the product word of the commodity title; and the result output unit is used for outputting the product words.

In some expanded embodiments, the product word processing apparatus further includes a dictionary construction module operating prior to the ranking score module, and configured to extract a plurality of lemmas from the product words pre-collected corresponding to each commodity classification, and store and construct a product dictionary of the corresponding commodity classification.

In an expanded embodiment, the product word processing apparatus of the present application further includes the following module, which is operated by the word determining module: the retrieval execution module is used for retrieving target commodities with the product words consistent with or similar to the product words from the commodity database according to the product words of the commodity titles; and the commodity pushing module is used for pushing the commodity information of the target commodity to the terminal equipment submitting the commodity title.

A computer device adapted for one of the purposes of the present application includes a central processing unit and a memory, the central processing unit being configured to invoke execution of a computer program stored in the memory to perform the steps of the product word processing method described herein.

A computer-readable storage medium, which stores a computer program implemented according to the product word processing method in the form of computer-readable instructions, and which, when called by a computer, performs the steps included in the method, is provided for another purpose of this application.

A computer program product, provided to adapt to another object of the present application, comprises computer programs/instructions which, when executed by a processor, implement the steps of the method described in any of the embodiments of the present application.

Compared with the prior art, the technical scheme of the application at least comprises the following technical advantages:

firstly, the method obtains a participle sequence by simply participle of a given commodity title, then determines the similar score of each participle according to the data distance between the participle and the commodity title in the semanteme, determines the sequencing score corresponding to each participle according to the position information implied by the natural sequencing in the participle sequence where the participle of a word element in a preset product dictionary hits, adds the similar score of each participle and the sequencing score thereof to obtain the comprehensive score corresponding to the participle, and realizes the comprehensive quantification of the importance of the information of two dimensions of the semanteme and the position of each participle through the comprehensive score so as to sequence the participle, and determines the participle with the highest comprehensive score as the product word of the commodity title.

Secondly, in the process of determining the sequencing score of each participle, each participle is matched with each word element preset in a preset product dictionary so as to determine whether each participle hits the product dictionary, so that the corresponding sequencing score of the hit participle is determined, and missed participles cannot be obtained, wherein the product dictionary realizes compatible recognition of different versions derived from the fact that the product words of the same commodity are expressed in multiple ways by shops on different lines through the word elements, so that the method is not difficult to understand, and can obtain extremely high recognition accuracy with the assistance of the product dictionary.

In addition, since the technical scheme of the application can effectively and accurately identify the product words of the given commodity titles with the advantages, the product words can be deployed as basic services of the e-commerce platform and serve the calling of each independent site of the e-commerce platform, so that effective services are provided for the demands of commodity searching, commodity advertisement putting, commodity collection and the like of each online shop.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow chart diagram of an exemplary embodiment of a product word processing method of the present application;

FIG. 2 is a flowchart illustrating a process of segmenting a title of a commodity according to an embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating a process for calculating a similarity score according to an embodiment of the present application;

FIG. 4 is a schematic flow chart illustrating a process for calculating a rank score according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating a process of calculating a composite score to determine a product term according to an embodiment of the present application;

FIG. 6 is a schematic flow chart diagram of an expanded embodiment of the product word processing method of the present application;

FIG. 7 is a functional block diagram of a product word processing apparatus of the present application;

fig. 8 is a schematic structural diagram of a computer device used in the present application.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As will be appreciated by those skilled in the art, "client," "terminal," and "terminal device" as used herein include both devices that are wireless signal receivers, which are devices having only wireless signal receivers without transmit capability, and devices that are receive and transmit hardware, which have receive and transmit hardware capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having single or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" can be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and the like.

The hardware referred to by the names "server", "client", "service node", etc. in the present application is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principles such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, and an output device, in which a computer program is stored in the memory, and the central processing unit loads a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby accomplishing specific functions.

It should be noted that the concept of "server" in the present application can be extended to the case of server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.

One or more technical features of the present application, unless expressly specified otherwise, may be deployed to a server for implementation by a client remotely invoking an online service interface provided by a capture server for access, or may be deployed directly and run on the client for access.

Unless specified in clear text, the neural network model referred to or possibly referred to in the application can be deployed in a remote server and performs remote invocation at a client, and can also be deployed in a client with sufficient equipment capability to perform direct invocation.

Various data referred to in the present application may be stored in a server remotely or in a local terminal device unless specified in the clear text, as long as the data is suitable for being called by the technical solution of the present application.

The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.

The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations therefrom.

The product word processing method can be programmed into a computer program product, is deployed in a client or a server to run, and is generally deployed in the server to be implemented in an e-commerce platform application scene including live e-commerce, so that the method can be executed by accessing an open interface after the computer program product runs and performing human-computer interaction with a process of the computer program product through a graphical user interface.

Referring to fig. 1, the product word processing method of the present application, in an exemplary embodiment, includes the following steps:

step S1100, performing word segmentation processing on the commodity title to obtain a plurality of ordered word segments to form a word segmentation sequence:

when the downstream task needing to determine the product words from the product titles provides the product titles, word segmentation processing can be started. The product title is generally a title for describing the product in the online shop of the e-commerce platform. The downstream task may be a commodity search task, a commodity advertisement delivery task, or a commodity collection task. The commodity searching task can realize the searching of similar commodities with the same or similar product words according to the commodity titles; the commodity advertising task is similar to the task, and commodities corresponding to commodity titles similar to the product words in semantics are matched according to the commodity title of one commodity accessed by a user; the commodity collection task is mainly used for conveniently aggregating a plurality of commodities according to the same information, such as product words, in the titles of the commodities. All such downstream tasks can rely on the product words obtained after the product titles are processed by the application, so that the product titles needing product word recognition can be submitted by the downstream tasks.

After the commodity title is obtained, the commodity title can be segmented by adopting a traditional segmentation mode, and optional segmentation methods include but are not limited to a mechanical segmentation method based on character string matching, a statistical segmentation method, an understanding-based segmentation method and the like. The word segmentation method based on statistics may include an algorithm based on word frequency statistics, an algorithm based on sequence probability, an algorithm based on deep learning, and the like. More specifically, for example: any one popular tool model such as N-Gram, Jieba, HMM, TF-IDF and the like is used for realizing word segmentation of the commodity title so as to obtain a plurality of corresponding words, for example, for the commodity title "classic style of winter mature men's colorful sports suit", after word segmentation, a plurality of words can be obtained, and each word is arranged according to the inherent sequence in the commodity title to form a word segmentation sequence, wherein the word segmentation sequence is expressed as follows:

{ winter; ripening; male maturity; making the male clothes ripe; a male garment; five colors; sports, sports suits, suits; classic; classic (a) }

Of course, after the same product title is segmented by different segmentation methods, the specific segmentation in the obtained segmentation sequence may be slightly different, and those skilled in the art will naturally understand this, and should not limit the scope covered by the inventive spirit of the present application by the examples herein.

Step S1200, calculating the data distance between the semantic feature vector of each participle and the semantic feature vector of the commodity title, and correspondingly taking the data distance as the similarity score of each participle:

in order to facilitate the determination of the semantic correlation degree between each participle in the participle sequence and the commodity title, the deep semantic information of each participle and the commodity title can be extracted in any feasible feature extraction manner to obtain the corresponding semantic feature vector thereof, thereby realizing the representation learning of the semantic information of each participle and the commodity title. The feature extraction method is generally recommended to be performed by adopting a deep learning model, and training is performed on the preselected deep learning model until convergence, so that the pre-selected deep learning model has the function of performing expression learning according to an embedded vector corresponding to a given text to obtain high-level semantic information of the text.

The participles and the commodity titles can adopt the same characteristic extraction mode to determine the corresponding semantic characteristic vectors. After the semantic feature vectors are obtained, a preset data distance algorithm can be adopted to calculate the data distance between the semantic feature vector of each participle and the semantic feature vector of the commodity title. The data distance algorithm includes, but is not limited to, a cosine similarity algorithm, an euclidean distance algorithm, a pearson correlation coefficient algorithm, a minkowski distance algorithm, a mahalanobis distance algorithm, a jaccard coefficient algorithm, etc., and one skilled in the art may optionally determine any data distance algorithm to implement, as long as the distance between the data and the point can be calculated.

After the data distance is determined, for convenience of calculation, in an alternative embodiment, the data distance may be normalized to a numerical space such as [0,1], so that the larger the numerical value is, the closer the data distance is, and thus the higher the possibility that the word segmentation becomes a product word is.

Therefore, the data distance between each participle and the commodity title represents the semantic association degree of the participle and the commodity title, so that the data distance can be determined as the similarity score of the participle, and the effective representation of the association degree of the participle and the commodity title from the semantic dimension is realized.

Step S1300, according to the sorting information of the participles of the word element in the hit preset product dictionary in the participle sequence, quantitatively determining the sorting score:

it should be noted that, as a variation of the present embodiment, the steps S1300 and S1200 may be executed out of order or concurrently, and the implementation of the present application is not affected.

In order to measure the ranking score of each participle according to the position information of the participle in the commodity title, whether each participle hits the lemma preset in the product dictionary or not can be detected by means of the preset product dictionary.

The product dictionary is used for supporting matching of all the participles of the application so as to determine whether a participle is possible to become a product word. In order to support the matching, a huge amount of lemmas are stored in the product dictionary in advance, and the lemmas, which are usually higher-level meaning concepts for describing a certain product word or a certain type of product word, are represented on a data level and are usually a substring of the product word, for example, in a Chinese context, "clothing" can be regarded as the lemmas of the product words such as "sportswear", "jacket", "shirt", and the like; the greeting card can also be regarded as the word element of the product words such as birthday greeting card, enterprise greeting card, etc. As another example, in the context of english, "suit" can be considered as a token of "sweetsutit". Such as this, it can be seen that, in general, for a lemma stored in the product dictionary, when a participle completely contains the lemma, the participle matches the lemma, so that it can be regarded that the participle hits the lemma, i.e. hits the product dictionary.

The acquisition of the lemma can be realized by combining word segmentation statistical technology or manual screening by the technical personnel in the field, and the embodiment of the creative spirit of the application is not influenced. Generally, a person skilled in the art can predetermine the corresponding product dictionary and the lemmas included therein according to the commodity categories required by the e-commerce platform.

The word segmentation of the word element hitting the product dictionary indicates that the word segmentation has the possibility of being the product word corresponding to the commodity title, while the word segmentation of the word element not hitting the product dictionary means that the word segmentation does not have the possibility of being the product word corresponding to the commodity title, so that the word segmentation can be ignored or, although the word segmentation is considered, the corresponding sorting score cannot be obtained in the step.

The word segmentation sequence is organized in order according to the appearance sequence of each word segmentation in the commodity title, so that the ranking value of each word segmentation in the word segmentation sequence represents the corresponding position information of the word segmentation, and the displayed size of the position information corresponds to the possibility that the word segmentation becomes a product word.

More specifically, according to the expression habit reflected by human language to the title of the product, the fixed language is mostly in the front position, and the modified words are in the rear position, for example, in the title of the product, the classic style of the winter-maturing men's colorful sports suit, the participle of "sports suit" is placed at the rear position relatively, so that the participle of "sports suit" is more likely to become the product word than the preceding word of "sports", and similarly, the word of "suit" is more likely to become the product word than "sports" and "sports suit" because of being further behind. Accordingly, a certain degree of reference information is provided for the possibility that the segmented word becomes a product word from the position dimension. Therefore, the corresponding sorting score can be correspondingly determined according to the appearance position of each participle in the participle sequence, namely the sorting value of the participle. In an embodiment, the ranking value may be directly used as the ranking score of the corresponding participle, and in a modified embodiment, a preset weight may be associated with the ranking value to restrict the contribution value of the position information. The sorting value may be a subscript of an array used to represent the sequence of participles, for example: 0. 1, 2, 3, 4 … …, etc. Accordingly, the quantification of the ranking score of the participles of the hit product dictionary is accomplished.

Similarly, for the intuition of calculation, the ranking score can be normalized to a numerical space of [0,1], so that the larger the numerical value is, the higher the possibility of representing the corresponding participle into a product word is.

Step S1400, outputting the participle with the highest comprehensive score as the product word of the commodity title, wherein the comprehensive score is the sum of the similarity score and the sequencing score of the corresponding participle:

after the steps, in the word segmentation sequence, each word which hits the product dictionary obtains the corresponding similarity score and the corresponding sequencing score, and the similarity score and the sequencing score are added to obtain the comprehensive score of the word segmentation, so that the word which does not hit the product dictionary can be disregarded. The comprehensive score realizes the comprehensive representation of the semantic similarity between the corresponding participles and the commodity title and the position importance of the corresponding participles in the participle sequence, and the comprehensive score can be used for measuring the possibility that the corresponding participles become the commodity title. Therefore, the participle with the highest comprehensive score can be selected from the participles hitting the product dictionary to serve as the product word of the commodity title, and then the participle is output to be submitted to a downstream task of the commodity title for use.

Through the exemplary embodiment and the modified embodiments thereof, it can be seen that, compared with the prior art, the technical solution of the present application at least includes the following technical advantages:

Referring to fig. 2, in a further embodiment, the step S1100 performs a word segmentation process on the title of the product to obtain a plurality of words, and forms a word segmentation sequence, including the following steps:

step S1110, acquiring a title of a commodity submitted by a user:

for the downstream tasks such as the commodity search, the commodity advertisement putting and the like, a target commodity can be determined by a user of the terminal equipment, so that the commodity title of the target commodity is used as the commodity title submitted by the user by the background.

Step S1120, performing word segmentation on the title of the commodity by using a preset word segmentation algorithm to obtain a plurality of word segmentations:

as described above, the word segmentation algorithm may adopt various conventional algorithms, in this embodiment, it is recommended to use a statistical-based N-Gram algorithm to segment words of the product title, a sliding window is preset to fetch words of the product title, the sliding step length of the sliding window may be set to be a single word, the window size may be set to be 2 words, 3 words, or 4 words, and the like, and those skilled in the art may flexibly implement the word segmentation algorithm. For example, an exemplary article title, "winter mature men's colorful gym suit classic style", which is a set of terms that can be obtained after a four-word sliding window sized term is made here, is:

{ winter; ripening; a mature male; mature men's clothing; male maturity; the men are ripe; a male garment; five colors; sports, sports suits, suits; classic; classic (a) }

Step S1130, according to the sequence of the multiple participles in the commodity title, constructing the multiple participles into a participle sequence, and representing the ordering information of the participles according to the ordering values of the participles in the participle sequence:

and according to the word segmentation set obtained in the previous step, further representing by adopting an array, converting and storing the word segmentation set into a word segmentation sequence, wherein subscripts of the array are organized in an order from small to large, so that the subscripts can serve as ordering values of the words, and the corresponding ordering information of the words is represented.

In the embodiment, the commodity title of a certain target commodity is obtained by responding to the commodity search service and is segmented to obtain the segmentation sequence, and then, the product words of the commodity title can be determined and output to the commodity search service to execute commodity search based on the product words by combining any other embodiment of the application, wherein the segmentation algorithm is flexibly adopted to segment the commodity title, the calculation amount is small, the segmentation is accurate, and the method is particularly suitable for an application scene with relatively discrete word meanings, namely the commodity title.

Referring to fig. 3, in a deepened embodiment of the present invention, the step S1200 of calculating a data distance between the semantic feature vector of each participle and the semantic feature vector of the commodity title, and correspondingly taking the data distance as a similarity score of each participle includes the following steps:

step S1210, performing word embedding on each participle and the commodity title respectively to obtain an embedded vector corresponding to each participle and the commodity title:

furthermore, each participle and each commodity title can be vectorized according to a preset word list, and embedded vectors corresponding to each participle and each commodity title are constructed. In a preferred alternative embodiment, the Word2Vec model may be implemented after being trained to converge with sufficient sample refinement, which may be implemented flexibly by those skilled in the art in accordance with the principles disclosed herein.

Step S1220, using the text feature extraction model trained to the convergence state to respectively perform representation learning on the embedded vectors corresponding to the participles and the commodity titles, and obtaining corresponding semantic feature vectors:

in order to extract the high-level semantics of the embedded vectors of the participles and the commodity titles, a text feature extraction model can be prepared, the text feature extraction model can be realized based on basic neural network models such as LSTM, Bert, Transformer, TextCNN and the like, and a person skilled in the art can train the model to a convergence state by adopting sufficient corresponding training samples, so that the person can learn the capability of extracting deep-level semantic vectors according to the input embedded vectors.

Accordingly, the text feature extraction model is adopted to respectively perform representation learning on the embedded vector of each participle and the embedded vector of the commodity title, and semantic feature vectors of each participle and the semantic feature vectors of the commodity title are correspondingly obtained.

Generally, the semantic feature vectors are uniformly mapped into high-dimensional vectors of the same dimension for the convenience of subsequent processing.

Step S1230, calculating a data distance between the semantic feature vector of each participle and the semantic feature vector of the commodity title, and taking the data distance as a similarity score of the participle:

as mentioned above, any known data distance algorithm, such as the euclidean distance algorithm or the cosine similarity algorithm, is selected to calculate the similarity between the semantic feature vector of each participle and the semantic feature vector of the commodity title, so as to obtain the corresponding data distance, and in order to make the table meanings uniform, the data distance can be normalized to a numerical value space of [0,1], so that the lower the numerical value is, the more dissimilar the two semantic feature vectors are semantically represented; and the higher the value, the closer semantically the two semantic vectors are to each other. Therefore, each participle corresponds to the determined data distance and can be used as a similarity score of the participle to represent the semantic similarity of the participle and the commodity title, and the quantitative representation of the semantic closeness between the participle and the commodity title is realized.

The embodiment exemplarily discloses a process for determining the similarity score of each participle, wherein on the basis of vectorizing each participle and a commodity title, the corresponding similarity score is quantitatively determined by utilizing the semantic closeness degree between the participle and the commodity title, so that high-level semantic information is abstracted from each other in the process, and therefore, the similarity score is more accurate when the data distance between the participle and the commodity title is calculated, and effective reference information according to decision is provided for determining the product word.

Referring to fig. 4, in a further embodiment, the step S1300 of determining the ranking score of the participle in the word segmentation sequence in a quantitative manner according to the ranking information of the participle hitting the lemma in the preset product dictionary includes the following steps:

step S1310, determining the commodity classification corresponding to the commodity title according to the semantic feature vector of the commodity title:

in this embodiment, the product dictionary may be set according to the commodity classification in the classification system of the e-commerce platform, if the classification system includes a plurality of hierarchies, a corresponding product dictionary is generally set only for each classification of one higher hierarchy, the product dictionary is generally recommended to be set correspondingly according to the commodity classification of the highest hierarchy, and the lemmas under the commodity classification are collected in advance to construct the product dictionary.

Accordingly, before determining the ranking score, the classification of the product to which the product corresponding to the product title belongs needs to be determined. Therefore, another preset commodity classification model can be adopted to carry out classification mapping on the semantic feature vectors of the commodity titles to obtain corresponding commodity classifications of the commodity titles. The commodity classification model can adopt Bert, TextCNN and the like as basic network models, and then a multi-classifier is accessed, and the classification space set by the multi-classifier contains the whole quantity of commodity classification required to be identified. And performing supervision training on the commodity classification model by adopting a sufficient amount of training samples by a person skilled in the art until the commodity classification model is in a convergence state, expressing and learning the embedded vector according to the input commodity title to obtain a corresponding semantic feature vector, and performing classification mapping according to the semantic feature vector to obtain a corresponding commodity classification.

Step S1320, detecting whether each segmented word includes at least one lemma in a product dictionary preset corresponding to the commodity classification, and when the segmented word includes the lemma, determining that the segmented word is an optional segmented word that hits the lemma in the product dictionary:

after the commodity classification is determined, a product dictionary corresponding to the commodity classification can be called. Then, for each participle in the participle sequence, it is detected one by one whether it contains at least one lemma in the product dictionary, and generally when it is detected that a participle contains one lemma in the product dictionary, it can be determined that the participle hits the lemma, thereby hitting the product dictionary, at this time, the participle becomes an optional participle in this embodiment, that is, it has a possible participle to become a product word. Thereby screening a part of the selectable participles from the participle sequence.

Step S1330, determining a ranking value of the selectable word in the word segmentation sequence, and setting a preset weight associated with the ranking value as a ranking score corresponding to the selectable word:

further, the corresponding ranking score of each selectable participle can be determined according to the ranking value of the selectable participle in the participle sequence, for example, the subscript value of the selectable participle in the array element of the participle sequence.

When determining the ranking score of a certain optional participle, in order to adjust the relation between the position information of the participle and the semantic information of the participle in the comprehensive score of the application, a preset weight can be associated with the ranking score, the weight can be flexibly set by technicians in the field according to business requirements, an adjusting mechanism is opened for the preset weight, and then the product of the preset weight and the ranking score is used as the ranking score of the optional participle. In some alternative embodiments, the preset weight is used to associate the similarity score related to the embodiments of the present application, and the same is true for the similarity score related to the embodiments of the present application. In a further modified embodiment, when calculating the comprehensive score of a certain participle, a preset weight is associated between the ranking score and the similarity score as a hyper-parameter to smooth the ranking score and the similarity score, so that the final result is normalized to a numerical space of [0,1], and the essence of the method is similar to the above, and belongs to an equivalent alternative embodiment of the embodiment provided by the application.

The embodiment allows a product dictionary to be set for each commodity classification under the condition that a plurality of commodity classifications exist in the e-commerce platform, so that professional construction of the product dictionary is facilitated, the corresponding commodity classification is determined according to a given commodity title, then the corresponding product dictionary is called according to the commodity classification to calculate the sorting score, the complex condition of the commodity classification of the e-commerce platform can be dealt with, the sorting score of the participles of the commodity title related to the corresponding commodity classification is correspondingly calculated by utilizing the professional accurate product dictionary, and the calculation of the sorting score is more accurate and representative.

Referring to fig. 5, in a further embodiment, the step S1400 of outputting the participle with the highest comprehensive score as the product word of the product title includes the following steps:

step S1410, calculating a sum of the similarity score and the ranking score of each participle hitting the lemma of the product dictionary, and obtaining a comprehensive score of the participle:

as described above, since a participle that hits a lemma of the product dictionary has a possibility of becoming a product word, a comprehensive score thereof can be calculated in a targeted manner, and a participle that misses any lemma of the product dictionary can be disregarded. Even if considered, it is expected that the overall score obtained by a participle that misses the product dictionary will be lower than other participles that hit the product dictionary, and thus less likely to interfere with a preferred decision on a product word, since it lacks at least the rank score.

In order to preferably select the most representative participle as the product word from the participles which hit the word elements of the product dictionary, the comprehensive score of each participle is calculated aiming at each participle, and the comprehensive score of each participle is just the sum of the similarity score and the sequencing score thereof. As described in the foregoing related embodiments, the similarity score or the ranking score may be matched with a preset weight, or a hyper-parameter may be used as a weight to smooth the similarity score and the ranking score.

Step S1420, reverse sorting the segmented words hitting the product dictionary according to the comprehensive score, and determining that a first segmented word is a product word of the commodity title:

in order to select the product words in the commodity title, the reverse ordering processing can be performed on each word hitting the product dictionary according to the comprehensive score, so that the more advanced word ordering is, the higher the comprehensive score is, and therefore, the word ranked at the head can be determined to be the only product word of the commodity title.

Step S1430, outputting the product words:

and then, the product words can be returned and pushed to a downstream task submitting the commodity titles, and the downstream task continues to execute specific services such as commodity searching, commodity matching and the like according to the product words.

The embodiment further discloses a process for deciding the unique product word of the commodity title according to the similarity score and the sequencing score of the participles, and the process is less in calculation amount, efficient and direct, and extremely low in occupation of system operation resources.

In another embodiment of the present application, the similarity score can also be calculated only for the participles hitting the product dictionary in the manner disclosed in the previous embodiments of the present application, and the whole calculation for each participle is not necessarily required, thereby saving the system overhead. In the case where the product dictionary is set to correspond to a plurality of product categories, the segmentation words of the product dictionary that hit the corresponding product category may be searched out, and the similarity score may be calculated only for the segmentation words, without considering other segmentation words that do not hit the product dictionary, with reference to step S1310 and step S1320.

In another equivalent embodiment of the present application, when calculating the comprehensive score, it may be considered to calculate a corresponding comprehensive score for each participle in the participle sequence, where for the participle in the missed product dictionary, the ranking score is 0, but if the similarity score is obtained with a particularly high value, it is highly likely to obtain the highest comprehensive score, and accordingly, the participle in the missed product dictionary may also be determined as the product word in the title of the product. Because the product word is a new added word relative to the product dictionary, the product word can be stored in the product dictionary to expand the product dictionary, thereby improving the product word recognition service capability of the product dictionary.

In some expanded embodiments, before the step of S1300, according to the step of hitting the ranking information of the participles of the lemmas in the preset product dictionary in the participle sequence, the method includes the following steps: extracting a plurality of lemmas from the pre-collected product words corresponding to each commodity classification, and storing and constructing a product dictionary of the corresponding commodity classification:

when the product dictionary is constructed in advance, a large number of product words are collected in advance according to commodity classification corresponding to the product dictionary, the product words can be extracted and realized from a pre-collected commodity title set by means of other realized product word extraction technologies, then, the lemma is extracted according to the product words, the occurrence frequency of words and words with fine granularity in the large number of product words can be counted firstly, then, the words and words with the occurrence frequency higher than the preset threshold value are determined as the lemma according to the preset threshold value and stored in the product dictionary, and therefore the construction of the product dictionary is achieved.

According to the technical scheme, although the preliminarily constructed product dictionary cannot be directly used for matching out the unique product words of the commodity titles, the product dictionary is utilized with each embodiment of the application, so that effective inference and decision of the product words of the commodity titles can be realized, and the cost is low and effective.

Referring to fig. 6, in an expanded embodiment, after the step S1400 of outputting the participle with the highest comprehensive score as the product word of the product title, the method further includes the following steps:

step S1500, according to the product words of the product titles, searching target products with the product words consistent with the product words or similar in semantics from a product database:

for an independent site, a commodity database corresponding to an online shop of the independent site stores commodity information of each commodity on the independent site, wherein the commodity information comprises a commodity title of the commodity, a product word corresponding to the commodity title can be determined by any one of the above embodiments of the application, and the product word is taken as a part of the commodity information and is subjected to associated mapping storage with the commodity title.

Accordingly, according to a given product title, the product title may be information submitted by a user, for example, a search string submitted by a terminal consumer user when performing a product search, or a product title submitted by an advertisement document editing user or a merchant user, after determining a corresponding product word for the product title according to any one of the embodiments described above in the present application, a target product corresponding to the product word or semantically similar product word may be determined based on rule matching or semantic matching, as a search matching result corresponding to a user task.

Step S1600, pushing the commodity information of the target commodity to the terminal equipment submitting the commodity title:

further, the commodity information of the target commodity, including but not limited to a commodity title, a commodity picture, a commodity price and the like, is obtained and pushed to the corresponding terminal device of the user, and is referred or accessed by the corresponding user. For example, for a user performing a product search, the corresponding target product may be accessed directly using the provided product information; for the advertisement file editing users and the merchant users, the provided commodity information can be used for referencing the corresponding commodity titles for further editing, and the like, so that the advertisement file editing users and the merchant users can be flexibly utilized.

The embodiment further enriches examples that the product words determined by the application serve various different downstream tasks, and accordingly, the technical scheme of the application can improve the basic service capability of the e-commerce platform, and can meet the requirements of different parties such as a consumer user side, a merchant user side and an advertisement and document editing user side, so that the service experience of the e-commerce platform is enriched and improved.

In another embodiment of the present application, corresponding product words may be determined for the product titles of each product in the product database of the independent site, and then clustering may be performed according to the product words, so as to realize aggregation of the product information.

Referring to fig. 7, a product word processing apparatus adapted to one of the purposes of the present application is a functional implementation of the product word processing method of the present application, and the apparatus includes a word segmentation processing module 1100, a similarity score module 1200, a ranking score module 1300, and a word determination module 1400, where: the word segmentation processing module 1100 is configured to perform word segmentation processing on the title of the commodity to obtain a plurality of ordered word segments, and form a word segmentation sequence; the similarity score module 1200 is configured to calculate a data distance between the semantic feature vector of each participle and the semantic feature vector of the commodity title, and accordingly serve as a similarity score of each participle; the ranking score module 1300 is configured to determine a ranking score of a participle in a preset product dictionary in a quantized manner according to ranking information of the participle in the participle sequence of the participle hitting a lemma in the preset product dictionary; the term determining module 1400 is configured to output a term with the highest comprehensive score as a product term of the title of the commodity, where the comprehensive score is a sum of a similarity score and a ranking score of the corresponding term.

In a further embodiment, the word segmentation processing module 1100 includes: the title acquisition unit is used for acquiring a commodity title submitted by a user; the word segmentation execution unit is used for segmenting the commodity title by adopting a preset word segmentation algorithm to obtain a plurality of words; and the sequencing representation unit is used for constructing the multiple participles into a participle sequence according to the sequence of the multiple participles in the commodity title, and representing the sequencing information of the participles through the sequencing values of the participles in the participle sequence.

In some embodiments of the present disclosure, the similarity score module 1200 includes: the coding processing unit is used for respectively embedding words into each participle and the commodity title to obtain embedded vectors corresponding to each participle and the commodity title; the expression learning unit is used for adopting a text feature extraction model trained to a convergence state to respectively carry out expression learning on each participle and the embedded vector corresponding to the commodity title to obtain corresponding semantic feature vectors; and the distance calculation unit is used for calculating the data distance between the semantic feature vector of each participle and the semantic feature vector of the commodity title, and taking the data distance as the similarity score of the participle.

In some embodiments of the deepening, the rank score module 1300 includes: the classification mapping unit is used for determining the commodity classification corresponding to the commodity title according to the semantic feature vector of the commodity title; the segmentation hit unit is used for detecting whether each segmentation contains at least one lemma in a product dictionary preset corresponding to the commodity classification or not, and when the segmentation contains the lemma, determining the segmentation as an optional segmentation hitting the lemma in the product dictionary; and the sorting score unit is used for determining a sorting value of the selectable participle in the participle sequence and setting the sorting value associated with a preset weight as a sorting score corresponding to the selectable participle.

In some embodiments, the term determining module 1400 includes: the score integration unit is used for calculating the sum of the similarity score and the sequencing score of each participle hitting the lemma of the product dictionary to obtain the integrated score of the participle; the sorting optimization unit is used for performing reverse sorting on each participle hitting the product dictionary according to the comprehensive score and determining that the first participle is the product word of the commodity title; and the result output unit is used for outputting the product words.

In an expanded embodiment, the product word processing apparatus of the present application further includes a dictionary constructing module that operates prior to the ranking score module 1300, and is configured to extract a plurality of lemmas from the pre-collected product words corresponding to each commodity category, and store the lemmas to construct a product dictionary of the corresponding commodity category.

In some expanded embodiments, the product word processing apparatus of the present application further includes the following modules that are executed by the word determining module 1400: the retrieval execution module is used for retrieving target commodities with the product words consistent with or similar to the product words from the commodity database according to the product words of the commodity titles; and the commodity pushing module is used for pushing the commodity information of the target commodity to the terminal equipment submitting the commodity title.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. As shown in fig. 8, the internal structure of the computer device is schematically illustrated. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can enable the processor to realize a product word processing method when being executed by the processor. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform the product word processing method of the present application. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In this embodiment, the processor is configured to execute specific functions of each module and its sub-module in fig. 7, and the memory stores program codes and various data required for executing the modules or the sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data required for executing all modules/sub-modules in the product word processing device of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.

The present application also provides a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the product word processing method of any of the embodiments of the present application.

The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method as described in any of the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when the computer program is executed, the processes of the embodiments of the methods can be included. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

To sum up, the method and the device can conveniently, efficiently and accurately determine the corresponding product words from the given commodity titles, and provide basic services for downstream tasks such as commodity search, commodity advertisement putting and commodity collection of the independent sites of the E-commerce platform, so that the service experience of the E-commerce platform is improved.

Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A product word processing method is characterized by comprising the following steps:

performing word segmentation processing on the commodity title to obtain a plurality of ordered words, and forming a word segmentation sequence;

2. The product word processing method according to claim 1, wherein the method of performing word segmentation on a title of a product to obtain a plurality of words to form a word segmentation sequence comprises the steps of:

acquiring a commodity title submitted by a user;

3. The product word processing method according to claim 1, wherein calculating a data distance between the semantic feature vector of each participle and the semantic feature vector of the commodity title as a similarity score of each participle, respectively, comprises the steps of:

adopting a text feature extraction model trained to a convergence state to respectively perform representation learning on the embedded vectors corresponding to the participles and the commodity titles to obtain corresponding semantic feature vectors;

4. The product word processing method according to claim 1, wherein the step of determining the ranking score of the participle hitting the lemma in the preset product dictionary quantitatively according to the ranking information of the participle in the participle sequence comprises the steps of:

and determining the sorting value of the selectable participle in the participle sequence, and setting the associated preset weight of the sorting value as the sorting score corresponding to the selectable participle.

5. The product word processing method according to claim 1, wherein outputting a word having a highest composite score as a product word of the title of the commodity comprises:

according to the comprehensive scores, reversely ordering all the participles hitting the product dictionary, and determining that the first participle is the product word of the commodity title;

and outputting the product words.

6. The product word processing method according to any one of claims 1 to 5, wherein the step of ranking information in the segmentation sequence according to a segmentation that hits a lemma in a preset product dictionary, comprises the steps of:

7. The product word processing method according to any one of claims 1 to 5, wherein after the step of outputting the word with the highest composite score as the product word of the title of the commodity, the method further comprises the steps of:

8. A product word processing apparatus, comprising:

the word segmentation processing module is used for carrying out word segmentation processing on the commodity title to obtain a plurality of ordered word segmentations so as to form a word segmentation sequence;

the similarity score module is used for calculating the data distance between the semantic feature vector of each participle and the semantic feature vector of the commodity title, and correspondingly serving as the similarity score of each participle;

the sorting score module is used for quantitatively determining the sorting score of the participle in the participle sequence according to the sorting information of the participle hitting the lemma in the preset product dictionary;

and the word determining module is used for outputting the word with the highest comprehensive score as the product word of the commodity title, and the comprehensive score is the sum of the similarity score and the sequencing score of the corresponding word.

9. A computer device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 7, which, when invoked by a computer, performs the steps comprised by the corresponding method.