CN115470322B - Keyword generation system and method based on artificial intelligence - Google Patents

Keyword generation system and method based on artificial intelligence Download PDF

Info

Publication number
CN115470322B
CN115470322B CN202211294577.9A CN202211294577A CN115470322B CN 115470322 B CN115470322 B CN 115470322B CN 202211294577 A CN202211294577 A CN 202211294577A CN 115470322 B CN115470322 B CN 115470322B
Authority
CN
China
Prior art keywords
similarity
data
commodity
value
search word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211294577.9A
Other languages
Chinese (zh)
Other versions
CN115470322A (en
Inventor
张飞
周南
刘奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Kuaiyun Technology Co ltd
Original Assignee
Shenzhen Kuaiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Kuaiyun Technology Co ltd filed Critical Shenzhen Kuaiyun Technology Co ltd
Priority to CN202211294577.9A priority Critical patent/CN115470322B/en
Publication of CN115470322A publication Critical patent/CN115470322A/en
Application granted granted Critical
Publication of CN115470322B publication Critical patent/CN115470322B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention provides a keyword generation system and a keyword generation method based on artificial intelligence, wherein the method comprises the following steps: acquiring commodity description data, and extracting a first search word from the commodity description data; acquiring potential bid item data of the commodity according to the first search word; processing the potential bid data by using an image processing algorithm, and filtering out the bid data with the similarity lower than a preset threshold value to obtain bid data; extracting bid title data from the bid data; extracting core commodity words from the bid title data; selecting a first core commodity word with frequency higher than a preset frequency value from the core commodity words by combining a preset search word data set; and generating keywords corresponding to the commodity according to the first core commodity word and combining keyword generation rules. Through this scheme, can gather bid article data, market data and automatic edit commodity keyword automatically intelligently, reduce manual operation in a large number, promote the efficiency of generating commodity document.

Description

Keyword generation system and method based on artificial intelligence
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a keyword generation system and method based on artificial intelligence.
Background
Along with the rapid development of network technology, electronic commerce technology is greatly developed, merchants often use an electronic commerce platform to popularize own commodities, advertisement keywords are core parameters of advertisement delivery services provided by the electronic commerce platform for the merchants, the merchants set related advertisement keywords and delivery strategies for the own commodities, and the electronic commerce platform can display the commodities to customers searching for the keywords in a certain strategy. In the advertisement putting process, a merchant expects to generate advertisement keywords with stronger pertinence, so that customers can obtain corresponding matched commodities when searching through the keywords, and the advertisement putting effect is improved.
However, the current method for determining advertisement keywords by merchants is to manually label the keywords of related products, but with the increase of the types of commodities, the acquisition workload of the advertisement keywords is increased, the generation efficiency of the keywords is reduced by a manual labeling mode, meanwhile, the keywords are labeled only from the angle of the merchant commodity, more delivery scenes cannot be matched, and the accuracy of the keywords is reduced.
Disclosure of Invention
The invention is based on the problems, and provides a keyword generation system and method based on artificial intelligence.
In view of this, an aspect of the present invention proposes an artificial intelligence based keyword generation system, including: the device comprises an extraction module, a data processing module and a generation module;
the extraction module is configured to:
acquiring commodity description data, and extracting a first search word from the commodity description data;
acquiring potential bid item data of the commodity according to the first search word;
the data processing module is configured to:
processing the potential bid data by using an image processing algorithm, and filtering out the bid data with the similarity lower than a preset threshold value to obtain bid data;
extracting bid title data from the bid data;
extracting core commodity words from the bid title data;
selecting a first core commodity word with frequency higher than a preset frequency value from the core commodity words by combining a preset search word data set;
the generation module is configured to: and generating keywords corresponding to the commodity according to the first core commodity word and combining keyword generation rules.
Optionally, in the step of processing the potential bid data by using an image processing algorithm and filtering out the bid data with the similarity lower than a preset threshold to obtain the bid data, the data processing module is specifically configured to:
Inputting the potential bid data, and marking a similarity identification value I as 0;
judging whether a first similarity value A1 of the potential bid data is larger than a first threshold value or not by using a first similarity judging model;
if the first similarity value A1 is larger than the first threshold value, judging whether a second similarity value A2 of the potential bid data is smaller than a second threshold value by using a second similarity judging model, and judging whether a third similarity value A3 of the potential bid data is smaller than a third threshold value by using a third similarity judging model;
if the second similarity value A2 is smaller than the second threshold value or the third similarity value A3 is smaller than the third threshold value, adding 1 to the similarity identification value I, and calculating a first similarity S1 by using a first similarity calculation method;
if the second similarity value A2 is not smaller than the second threshold value or the third similarity value A3 is not smaller than the third threshold value, calculating the first similarity S1 by using the first similarity calculation method;
the first similarity calculation method comprises the following steps: the first similarity s1=a1×first similarity a1+a2×second similarity a2+a3×third similarity a3+b1×similarity identification value I, wherein a1, A2, A3, b1 are weight coefficients greater than 0 and a1+a2+a3+b1=1
If the first similarity value A1 is not greater than the first threshold value, processing the image data in the potential bidding data by using an image processing algorithm to obtain the potential bidding image data;
judging whether a fourth similarity value A4 of the potential bidding image data is smaller than a fourth threshold value by using a fourth similarity judging model, and judging whether a fifth similarity value A5 of the potential bidding image data is smaller than a fifth threshold value by using a fifth similarity judging model;
if the fourth similarity value A4 is not smaller than the fourth threshold value or the fifth similarity value A5 is not smaller than the fifth threshold value, adding 1 to the similarity identification value I, and calculating a second similarity S2 by using a second similarity calculation method;
if the fourth similarity value A4 is smaller than the fourth threshold value or the fifth similarity value A5 is smaller than the fifth threshold value, calculating the second similarity S1 by using the second similarity calculation method;
the second similarity calculation method comprises the following steps: the second similarity s2=a6×first similarity a1+a4×fourth similarity a4+a5×fifth similarity a5+b2×similarity identification value I, wherein a4, A5, a6, b2 are weight coefficients greater than 0 and a4+a5+a6+b2=1;
Judging whether the first similarity S1 or the second similarity S2 is not smaller than the preset threshold, if yes, marking the potential bid data as similar, and if not, marking the potential bid data as dissimilar;
and extracting all data marked as similar in the potential bid data as the bid data.
Optionally, in the step of acquiring the commodity description data and extracting the first search term from the commodity description data, the extracting module is specifically configured to:
step one: classifying the commodity description data according to commodity names and commodity attributes, and performing text preprocessing on the classified commodity description data to generate candidate search word sequences;
step two: extracting feature data of the candidate search word sequence, and labeling the feature data to obtain a labeled sample set and a non-labeled sample set;
step three: using the labeling sample set as a training set, and training a search word classification model by using a neural network;
step four: classifying and predicting candidate search words in the unlabeled sample set by using the trained search word classification model, and calculating the matching degree of each unlabeled sample;
Step five: selecting the corresponding unlabeled sample with the matching degree exceeding a preset matching degree value, adding the unlabeled sample into the training set, and retraining the search word classification model;
step six: repeating the fourth step to the fifth step until the proportion of the matching degree of each unlabeled sample, which is higher than the preset matching degree value, exceeds the preset proportion, so as to obtain a final search word classification model;
step seven: and inputting the characteristic data of the commodity description data into the final search word classification model for processing, and extracting the first search word from the processing result.
Optionally, the step one: classifying the commodity description data according to commodity names and commodity attributes, and performing text preprocessing on the classified commodity description data to generate candidate search word sequences, wherein the extraction module is specifically configured to:
extracting text data from the commodity description data;
counting and numbering all sentences in the text data;
dividing the sentence into a plurality of words, and recording the position information of the words in the sentence;
analyzing and marking the part of speech of the words;
deleting a first word with a preset part of speech from the words to obtain a modified word set;
Performing de-duplication operation on the modified word set to obtain a candidate word set;
classifying the candidate word set according to commodity names and commodity attributes;
and performing text preprocessing on the classified candidate word set to generate the candidate search word sequence.
Optionally, in the extracting feature data of the candidate search term sequence in the second step, the extracting module is specifically configured to:
generating a first word vector table by using the trained word vector model;
generating a candidate search word vector sequence corresponding to the candidate search word sequence according to the first word vector table;
dividing the candidate search word vector sequences into n clusters according to the distance between the candidate search word vector sequences;
generating cluster center vectors of the n clusters according to a clustering algorithm;
quantifying the relation between the candidate search word sequence and the clustering center vector according to a distance formula to obtain semantic features of the candidate search word sequence;
extracting language features, word frequency features, length features and position features from the semantic features as the feature data.
Another aspect of the present invention provides an artificial intelligence based keyword generation method, including:
Acquiring commodity description data, and extracting a first search word from the commodity description data;
acquiring potential bid item data of the commodity according to the first search word;
processing the potential bid data by using an image processing algorithm, and filtering out the bid data with the similarity lower than a preset threshold value to obtain bid data;
extracting bid title data from the bid data;
extracting core commodity words from the bid title data;
selecting a first core commodity word with frequency higher than a preset frequency value from the core commodity words by combining a preset search word data set;
and generating keywords corresponding to the commodity according to the first core commodity word and combining keyword generation rules.
Optionally, the step of processing the potential bid data by using an image processing algorithm and filtering the bid data with the similarity lower than a preset threshold value to obtain the bid data includes:
inputting the potential bid data, and marking a similarity identification value I as 0;
judging whether a first similarity value A1 of the potential bid data is larger than a first threshold value or not by using a first similarity judging model;
if the first similarity value A1 is larger than the first threshold value, judging whether a second similarity value A2 of the potential bid data is smaller than a second threshold value by using a second similarity judging model, and judging whether a third similarity value A3 of the potential bid data is smaller than a third threshold value by using a third similarity judging model;
If the second similarity value A2 is smaller than the second threshold value or the third similarity value A3 is smaller than the third threshold value, adding 1 to the similarity identification value I, and calculating a first similarity S1 by using a first similarity calculation method;
if the second similarity value A2 is not smaller than the second threshold value or the third similarity value A3 is not smaller than the third threshold value, calculating the first similarity S1 by using the first similarity calculation method;
the first similarity calculation method comprises the following steps: the first similarity s1=a1×first similarity a1+a2×second similarity a2+a3×third similarity a3+b1×similarity identification value I, wherein a1, A2, A3, b1 are weight coefficients greater than 0 and a1+a2+a3+b1=1
If the first similarity value A1 is not greater than the first threshold value, processing the image data in the potential bidding data by using an image processing algorithm to obtain the potential bidding image data;
judging whether a fourth similarity value A4 of the potential bidding image data is smaller than a fourth threshold value by using a fourth similarity judging model, and judging whether a fifth similarity value A5 of the potential bidding image data is smaller than a fifth threshold value by using a fifth similarity judging model;
If the fourth similarity value A4 is not smaller than the fourth threshold value or the fifth similarity value A5 is not smaller than the fifth threshold value, adding 1 to the similarity identification value I, and calculating a second similarity S2 by using a second similarity calculation method;
if the fourth similarity value A4 is smaller than the fourth threshold value or the fifth similarity value A5 is smaller than the fifth threshold value, calculating the second similarity S2 by using the second similarity calculation method;
the second similarity calculation method comprises the following steps: the second similarity s2=a6×first similarity a1+a4×fourth similarity a4+a5×fifth similarity a5+b2×similarity identification value I, wherein a4, A5, a6, b2 are weight coefficients greater than 0 and a4+a5+a6+b2=1;
judging whether the first similarity S1 or the second similarity S2 is not smaller than the preset threshold, if yes, marking the potential bid data as similar, and if not, marking the potential bid data as dissimilar;
and extracting all data marked as similar in the potential bid data as the bid data.
Optionally, the step of acquiring commodity description data and extracting the first search word from the commodity description data includes:
Step one: classifying the commodity description data according to commodity names and commodity attributes, and performing text preprocessing on the classified commodity description data to generate candidate search word sequences;
step two: extracting feature data of the candidate search word sequence, and labeling the feature data to obtain a labeled sample set and a non-labeled sample set;
step three: using the labeling sample set as a training set, and training a search word classification model by using a neural network;
step four: classifying and predicting candidate search words in the unlabeled sample set by using the trained search word classification model, and calculating the matching degree of each unlabeled sample;
step five: selecting the corresponding unlabeled sample with the matching degree exceeding a preset matching degree value, adding the unlabeled sample into the training set, and retraining the search word classification model;
step six: repeating the fourth step to the fifth step until the proportion of the matching degree of each unlabeled sample, which is higher than the preset matching degree value, exceeds the preset proportion, so as to obtain a final search word classification model;
step seven: and inputting the characteristic data of the commodity description data into the final search word classification model for processing, and extracting the first search word from the processing result.
Optionally, the step one: classifying the commodity description data according to commodity names and commodity attributes, and generating candidate search word sequences after text preprocessing of the classified commodity description data, wherein the method comprises the following steps:
extracting text data from the commodity description data;
counting and numbering all sentences in the text data;
dividing the sentence into a plurality of words, and recording the position information of the words in the sentence;
analyzing and marking the part of speech of the words;
deleting a first word with a preset part of speech from the words to obtain a modified word set;
performing de-duplication operation on the modified word set to obtain a candidate word set;
classifying the candidate word set according to commodity names and commodity attributes;
and performing text preprocessing on the classified candidate word set to generate the candidate search word sequence.
Optionally, the extracting the feature data of the candidate search word sequence in the second step includes:
generating a first word vector table by using the trained word vector model;
generating a candidate search word vector sequence corresponding to the candidate search word sequence according to the first word vector table;
Dividing the candidate search word vector sequences into n clusters according to the distance between the candidate search word vector sequences;
generating cluster center vectors of the n clusters according to a clustering algorithm;
quantifying the relation between the candidate search word sequence and the clustering center vector according to a distance formula to obtain semantic features of the candidate search word sequence;
extracting language features, word frequency features, length features and position features from the semantic features as the feature data.
By adopting the technical scheme, the method comprises the steps of obtaining commodity description data and extracting a first search word from the commodity description data; acquiring potential bid item data of the commodity according to the first search word; processing the potential bid data by using an image processing algorithm, and filtering out the bid data with the similarity lower than a preset threshold value to obtain bid data; extracting bid title data from the bid data; extracting core commodity words from the bid title data; selecting a first core commodity word with frequency higher than a preset frequency value from the core commodity words by combining a preset search word data set; and generating keywords corresponding to the commodity according to the first core commodity word and combining keyword generation rules. Through this scheme, can gather bid article data, market data and automatic edit commodity keyword automatically intelligently, reduce manual operation in a large number, promote the efficiency of generating commodity document.
Drawings
FIG. 1 is a schematic block diagram of an artificial intelligence based keyword generation system provided in one embodiment of the present invention;
FIG. 2 is a flow chart of an artificial intelligence based keyword generation method provided by one embodiment of the present invention;
FIG. 3 is a flowchart of an artificial intelligence based keyword generation method according to another embodiment of the present invention;
FIG. 4 is a flowchart of an artificial intelligence based keyword generation method according to another embodiment of the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.
The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
An artificial intelligence-based keyword generation system and method according to some embodiments of the present invention are described below with reference to fig. 1 to 4.
As shown in FIG. 1, one embodiment of the present invention provides an artificial intelligence based keyword generation system, comprising: the device comprises an extraction module, a data processing module and a generation module;
the extraction module is configured to:
acquiring commodity description data, and extracting a first search word from the commodity description data;
acquiring potential bid item data of the commodity according to the first search word;
the data processing module is configured to:
processing the potential bid data by using an image processing algorithm, and filtering out the bid data with the similarity lower than a preset threshold value to obtain bid data;
Extracting bid title data from the bid data;
extracting core commodity words from the bid title data;
selecting a first core commodity word with frequency higher than a preset frequency value from the core commodity words by combining a preset search word data set;
the generation module is configured to: and generating keywords corresponding to the commodity according to the first core commodity word and combining keyword generation rules.
It may be appreciated that in this embodiment, the extracting module (e.g. the crawler module) may obtain the commodity description data (e.g. the content of the introduced commodity of the commodity specification, the commodity scheme, etc.) from the network platform and/or the e-commerce platform and/or the network server, and extract the first search word or the search sentence or the search text, etc. such as the commodity identifier, the commodity name, the commodity attribute, etc. from the commodity description data.
And then, acquiring potential bid item data of the commodity according to the first search word, namely searching in a corresponding network platform and/or an electronic commerce platform and/or a network server and/or a service site through text information, searching as many same commodity or similar commodity data as possible, entering the same commodity or similar commodity data into a data acquisition system as potential bid item data, and establishing different dictionary libraries according to the potential bid item data based on different dimensions.
And because the number of the bid data acquired according to the first search word is excessive, filtering and screening are needed, the potential bid data can be processed by the data processing module through an image processing algorithm, and the bid data is obtained after the data of the bid with the similarity lower than a preset threshold value is filtered.
Then, through the data processing module, in combination with a pre-established commodity word stock, an attribute word stock and the like, the bid item title data can be extracted from the bid item data, and core commodity words can be extracted from the bid item title data. The commodity word stock data has millions of words, mainly multi-element words; the attribute word stock comprises word data of various dimensions such as brands, materials, appearance, shape, color, applicability and the like of the commodities. After the historical search word data set provided by the background of the E-commerce platform is obtained, the historical search word data set is stored in a database, and an inverted index is established to improve the response efficiency of the interface.
And then, selecting a first core commodity word with frequency higher than a preset frequency value from the core commodity words by combining a preset search word data set. Through research and statistical analysis of the applicant, the following is found: the commodity name is generally the part of the word with the highest word frequency in the commodity document and most appears in the header. For the search words of the commodity, the words of the applicable crowd and the applicable scene are very concentrated, and the fixed word stock (namely the search word data set) is established, and the matching is carried out, so that the commodity is obtained. The words are fixed in position in the title, and the matched words are concentrated, and the method for establishing the fixed word stock can be as follows: a batch of initial seed words are screened, related applicable crowd and scene words are mined iteratively, and a search word data set is established. Manual intervention can be added in the iteration, and irrelevant words can be removed in time. Wherein applicable crowd and applicable scene words, there are obvious differences in the context in trade names/titles, such as: suitable crowd often appears in commodity names/titles such as toys, gifts, clothes, jewelry and the like, more electronic products are suitable scene words and the like, and accordingly suitable word vectors and context words assist in manually distinguishing suitable scenes and suitable crowd words, and construction of a word stock is completed.
The commodity attribute words are generally used for explaining the selling points or features of commodities, and include relatively important commodity attributes, characteristic descriptions of various commodities and the like. The definition of the characteristic words is fuzzy, the fault tolerance is strong, and the characteristic words can be extracted by combining with the attribute list of each commodity.
And finally, the generation module generates keywords corresponding to the commodity according to the first core commodity word and the keyword generation rule.
And generating proper description keywords/titles of corresponding commodities under the condition of meeting the differences of the keywords/titles and the platforms by combining the keyword/title generation rules of all electronic commerce platforms through the first core commodity words extracted in the previous step, such as core keywords, characteristic words, brand words, applicable crowd, applicable scenes and the like.
By adopting the technical scheme of the embodiment, the bid data and the market data can be automatically and intelligently acquired, commodity keywords can be automatically edited, manual operation is greatly reduced, and the efficiency of generating commodity texts is improved.
It should be understood that the block diagram of the artificial intelligence based keyword generation system shown in fig. 1 is only illustrative, and the number of the illustrated modules does not limit the scope of the present invention.
In some possible embodiments of the present invention, in the step of processing the potential bid data by using an image processing algorithm and filtering out the bid data with a similarity lower than a preset threshold to obtain bid data, the data processing module is specifically configured to:
inputting the potential bid data, and marking a similarity identification value I as 0;
judging whether a first similarity value A1 of the potential bid data is larger than a first threshold value or not by using a first similarity judging model;
if the first similarity value A1 is larger than the first threshold value, judging whether a second similarity value A2 of the potential bid data is smaller than a second threshold value by using a second similarity judging model, and judging whether a third similarity value A3 of the potential bid data is smaller than a third threshold value by using a third similarity judging model;
if the second similarity value A2 is smaller than the second threshold value or the third similarity value A3 is smaller than the third threshold value, adding 1 to the similarity identification value I, and calculating a first similarity S1 by using a first similarity calculation method;
if the second similarity value A2 is not smaller than the second threshold value or the third similarity value A3 is not smaller than the third threshold value, calculating the first similarity S1 by using the first similarity calculation method;
The first similarity calculation method comprises the following steps: the first similarity s1=a1×first similarity a1+a2×second similarity a2+a3×third similarity a3+b1×similarity identification value I, wherein a1, A2, A3, b1 are weight coefficients greater than 0 and a1+a2+a3+b1=1
If the first similarity value A1 is not greater than the first threshold value, processing the image data in the potential bidding data by using an image processing algorithm to obtain the potential bidding image data;
judging whether a fourth similarity value A4 of the potential bidding image data is smaller than a fourth threshold value by using a fourth similarity judging model, and judging whether a fifth similarity value A5 of the potential bidding image data is smaller than a fifth threshold value by using a fifth similarity judging model;
if the fourth similarity value A4 is not smaller than the fourth threshold value or the fifth similarity value A5 is not smaller than the fifth threshold value, adding 1 to the similarity identification value I, and calculating a second similarity S2 by using a second similarity calculation method;
if the fourth similarity value A4 is smaller than the fourth threshold value or the fifth similarity value A5 is smaller than the fifth threshold value, calculating the second similarity S2 by using the second similarity calculation method;
The second similarity calculation method comprises the following steps: the second similarity s2=a6×first similarity a1+a4×fourth similarity a4+a5×fifth similarity a5+b2×similarity identification value I, wherein a4, A5, a6, b2 are weight coefficients greater than 0 and a4+a5+a6+b2=1;
judging whether the first similarity S1 or the second similarity S2 is not smaller than the preset threshold, if yes, marking the potential bid data as similar, and if not, marking the potential bid data as dissimilar;
and extracting all data marked as similar in the potential bid data as the bid data.
It can be understood that from two dimensions of the text and the image, multiple models can be constructed according to their respective characteristics to calculate the similarity, and finally the results given by the multiple models are weighted and summed to determine whether the results are truly similar bid products.
In this embodiment, first, a first similarity determination model is used to perform a preliminary determination (may be a similarity determination on text data), and when a first similarity value A1 is obtained to be greater than a first threshold (e.g. 80%), a second determination is further performed from other dimensions/precision or models trained by using other algorithms to improve accuracy, for example, a second similarity determination model is used to determine whether a second similarity value A2 of the potential bid data is less than a second threshold, and/or a third similarity determination model is used to determine whether a third similarity value A3 of the potential bid data is less than a third threshold. The second similarity determination model and the third similarity determination model may be models (or models of other dimensions) that perform similarity determination on text data; if the second similarity value A2 is smaller than the second threshold value or the third similarity value A3 is smaller than the third threshold value, adding 1 to the similarity identification value I, and calculating the first similarity S1 by using a first similarity calculation method, where, for the potential bid data with higher text comparison similarity, other accuracy/dimension judgment models can be used to perform further judgment, when the obtained second similarity value A2 is smaller than the second threshold value (e.g. 60%) or the obtained third similarity value A3 is smaller than the third threshold value (e.g. 50%), it is indicated that there is a possibility of erroneous judgment in the preliminary judgment, adding 1 to the similarity identification value I to reduce the weight of the previous three judgment models, and calculating the first similarity S1 by using the first similarity calculation method. And if the second similarity value A2 is not smaller than the second threshold value or the third similarity value A3 is not smaller than the third threshold value, calculating the first similarity S1 by using the first similarity calculation method. In some embodiments, the second similarity determination model and the third similarity determination model may be a model for performing similarity determination on image data, or may be one of the two models for performing similarity determination on image data, and one of the two models is a model (or other model) for performing similarity determination on text data.
It may be appreciated that if the first similarity value A1 is not greater than the first threshold value, processing the image data in the potential bid image data by using an image processing algorithm to obtain potential bid image data, determining whether a fourth similarity value A4 of the potential bid image data is less than a fourth threshold value by using a fourth similarity determination model, and determining whether a fifth similarity value A5 of the potential bid image data is less than a fifth threshold value by using a fifth similarity determination model; if the fourth similarity value A4 is not smaller than the fourth threshold value or the fifth similarity value A5 is not smaller than the fifth threshold value, indicating that there may be erroneous judgment in the preliminary judgment, adding 1 to the similarity identification value I to reduce the weights of the output results of the first, fourth and fifth similarity judgment models, and calculating the second similarity S2 by using a second similarity calculation method; and if the fourth similarity value A4 is smaller than the fourth threshold value or the fifth similarity value A5 is smaller than the fifth threshold value, calculating the second similarity S2 by using the second similarity calculation method. In the embodiment, the model for judging the similarity of the image data is added through two models with different accuracies (or trained by different algorithms), so that the judging accuracy is improved, and the problem that the real bid data is missed due to poor comparison results of the text data can be avoided.
In some possible embodiments of the present invention, in the step of obtaining the commodity description data and extracting the first search term from the commodity description data, the extracting module is specifically configured to:
step one: classifying the commodity description data according to commodity names and commodity attributes, and performing text preprocessing on the classified commodity description data to generate candidate search word sequences;
step two: extracting feature data of the candidate search word sequence, and labeling the feature data to obtain a labeled sample set and a non-labeled sample set;
step three: using the labeling sample set as a training set, and training a search word classification model by using a neural network;
step four: classifying and predicting candidate search words in the unlabeled sample set by using the trained search word classification model, and calculating the matching degree of each unlabeled sample;
step five: selecting the corresponding unlabeled sample with the matching degree exceeding a preset matching degree value, adding the unlabeled sample into the training set, and retraining the search word classification model;
step six: repeating the fourth step to the fifth step until the proportion of the matching degree of each unlabeled sample, which is higher than the preset matching degree value, exceeds the preset proportion, so as to obtain a final search word classification model;
Step seven: and inputting the characteristic data of the commodity description data into the final search word classification model for processing, and extracting the first search word from the processing result.
It may be appreciated that in this embodiment, after extracting the feature data of the candidate search term sequence, a part of the feature data is labeled to obtain a labeled sample set, and another part of the feature data is a non-labeled sample set, the search term classification model is trained by using labeled sample set data through a neural network, and then the search term classification model is further trained by using non-labeled sample set data until the proportion of the matching degree value set of each non-labeled sample, which is higher than the preset matching degree value, exceeds the preset proportion, thereby achieving the purpose of improving the performance of the search term classification model.
In some possible embodiments of the invention, the step one: classifying the commodity description data according to commodity names and commodity attributes, and performing text preprocessing on the classified commodity description data to generate candidate search word sequences, wherein the extraction module is specifically configured to:
extracting text data from the commodity description data;
Counting and numbering all sentences in the text data;
dividing the sentence into a plurality of words, and recording the position information of the words in the sentence;
analyzing and marking the part of speech of the words;
deleting a first word with a preset part of speech from the words to obtain a modified word set;
performing de-duplication operation on the modified word set to obtain a candidate word set;
classifying the candidate word set according to commodity names and commodity attributes;
and performing text preprocessing on the classified candidate word set to generate the candidate search word sequence.
It can be understood that, in order to improve the accuracy of text recognition and judgment, in this embodiment, text data is extracted from the commodity description data, all sentences in the text data are counted and numbered, the sentences are divided into a plurality of words, and the position information of the words in the sentences is recorded; analyzing and marking the part of speech of the words; deleting a first word which is of preset part of speech (such as adjective, adverb, pronoun, auxiliary word and the like) from the words and generates nonsensical first words for the keywords, so as to obtain a modified word set; performing de-duplication operation on the modified word set to obtain a candidate word set; classifying the candidate word set according to commodity names and commodity attributes; and performing text preprocessing on the classified candidate word set to generate the candidate search word sequence.
In some possible embodiments of the present invention, in the extracting feature data of the candidate search term sequence in the second step, the extracting module is specifically configured to:
generating a first word vector table by using the trained word vector model;
generating a candidate search word vector sequence corresponding to the candidate search word sequence according to the first word vector table;
dividing the candidate search word vector sequences into n clusters according to the distance between the candidate search word vector sequences;
generating cluster center vectors of the n clusters according to a clustering algorithm;
quantifying the relation between the candidate search word sequence and the clustering center vector according to a distance formula to obtain semantic features of the candidate search word sequence;
extracting language features, word frequency features, length features and position features from the semantic features as the feature data.
It can be understood that, in this embodiment, in order to improve the efficiency and accuracy of feature data extraction, a candidate search word vector sequence corresponding to the candidate search word sequence is generated by vectorizing feature data of the candidate search word sequence to perform vector operation; dividing the candidate search word vector sequences into n clusters according to the distance between the candidate search word vector sequences; generating cluster center vectors of the n clusters according to a clustering algorithm, and quantifying the relation between the candidate search word sequences and the cluster center vectors according to a Euclidean distance formula to obtain semantic features of the candidate search word sequences; extracting language features, word frequency features, length features and position features from the semantic features as the feature data.
Referring to fig. 2, another embodiment of the present invention provides an artificial intelligence based keyword generation method, which includes:
acquiring commodity description data, and extracting a first search word from the commodity description data;
acquiring potential bid item data of the commodity according to the first search word;
processing the potential bid data by using an image processing algorithm, and filtering out the bid data with the similarity lower than a preset threshold value to obtain bid data;
extracting bid title data from the bid data;
extracting core commodity words from the bid title data;
selecting a first core commodity word with frequency higher than a preset frequency value from the core commodity words by combining a preset search word data set;
and generating keywords corresponding to the commodity according to the first core commodity word and combining keyword generation rules.
It may be appreciated that in this embodiment, the extracting module (e.g. the crawler module) may obtain the commodity description data (e.g. the content of the introduced commodity of the commodity specification, the commodity scheme, etc.) from the network platform and/or the e-commerce platform and/or the network server, and extract the first search word or the search sentence or the search text, etc. such as the commodity identifier, the commodity name, the commodity attribute, etc. from the commodity description data.
And then, acquiring potential bid item data of the commodity according to the first search word, namely searching in a corresponding network platform and/or an electronic commerce platform and/or a network server and/or a service site through text information, searching as many same commodity or similar commodity data as possible, entering the same commodity or similar commodity data into a data acquisition system as potential bid item data, and establishing different dictionary libraries according to the potential bid item data based on different dimensions.
Because the number of the bid data acquired according to the first search word is too large, filtering and screening are needed, the potential bid data can be processed by utilizing an image processing algorithm, and the bid data is obtained after the bid data with the similarity lower than a preset threshold value are filtered.
Then, by combining a commodity word library, an attribute word library, and the like which are established in advance, bid item title data from which core commodity words are extracted can be extracted. The commodity word stock data has millions of words, mainly multi-element words; the attribute word stock comprises word data of various dimensions such as brands, materials, appearance, shape, color, applicability and the like of the commodities. After the historical search word data set provided by the background of the E-commerce platform is obtained, the historical search word data set is stored in a database, and an inverted index is established to improve the response efficiency of the interface.
And then, selecting a first core commodity word with frequency higher than a preset frequency value from the core commodity words by combining a preset search word data set. Through research and statistical analysis of the applicant, the following is found: the commodity name is generally the part of the word with the highest word frequency in the commodity document and most appears in the header. For the search words of the commodity, the words of the applicable crowd and the applicable scene are very concentrated, and the fixed word stock (namely the search word data set) is established, and the matching is carried out, so that the commodity is obtained. The words are fixed in position in the title, and the matched words are concentrated, and the method for establishing the fixed word stock can be as follows: a batch of initial seed words are screened, related applicable crowd and scene words are mined iteratively, and a search word data set is established. Manual intervention can be added in the iteration, and irrelevant words can be removed in time. Wherein applicable crowd and applicable scene words, there are obvious differences in the context in trade names/titles, such as: suitable crowd often appears in commodity names/titles such as toys, gifts, clothes, jewelry and the like, more electronic products are suitable scene words and the like, and accordingly suitable word vectors and context words assist in manually distinguishing suitable scenes and suitable crowd words, and construction of a word stock is completed.
The commodity attribute words are generally used for explaining the selling points or features of commodities, and include relatively important commodity attributes, characteristic descriptions of various commodities and the like. The definition of the characteristic words is fuzzy, the fault tolerance is strong, and the characteristic words can be extracted by combining with the attribute list of each commodity.
And finally, generating keywords corresponding to the commodity according to the first core commodity word and combining keyword generation rules.
And generating proper description keywords/titles of corresponding commodities under the condition of meeting the differences of the keywords/titles and the platforms by combining the keyword/title generation rules of all electronic commerce platforms through the first core commodity words extracted in the previous step, such as core keywords, characteristic words, brand words, applicable crowd, applicable scenes and the like.
By adopting the technical scheme of the embodiment, commodity description data are acquired, and a first search word is extracted from the commodity description data; acquiring potential bid item data of the commodity according to the first search word; processing the potential bid data by using an image processing algorithm, and filtering out the bid data with the similarity lower than a preset threshold value to obtain bid data; extracting bid title data from the bid data; extracting core commodity words from the bid title data; selecting a first core commodity word with frequency higher than a preset frequency value from the core commodity words by combining a preset search word data set; according to the first core commodity words, the key words corresponding to the commodity are generated by combining the key word generation rules, the bid data and the market data can be automatically and intelligently collected, the commodity key words can be automatically edited, manual operation is greatly reduced, and the efficiency of generating commodity texts is improved.
In some possible embodiments of the present invention, the step of processing the potential bid data by using an image processing algorithm and filtering the bid data with the similarity lower than a preset threshold to obtain bid data includes:
inputting the potential bid data, and marking a similarity identification value I as 0;
judging whether a first similarity value A1 of the potential bid data is larger than a first threshold value or not by using a first similarity judging model;
if the first similarity value A1 is larger than the first threshold value, judging whether a second similarity value A2 of the potential bid data is smaller than a second threshold value by using a second similarity judging model, and judging whether a third similarity value A3 of the potential bid data is smaller than a third threshold value by using a third similarity judging model;
if the second similarity value A2 is smaller than the second threshold value or the third similarity value A3 is smaller than the third threshold value, adding 1 to the similarity identification value I, and calculating a first similarity S1 by using a first similarity calculation method;
if the second similarity value A2 is not smaller than the second threshold value or the third similarity value A3 is not smaller than the third threshold value, calculating the first similarity S1 by using the first similarity calculation method;
The first similarity calculation method comprises the following steps: the first similarity s1=a1×first similarity a1+a2×second similarity a2+a3×third similarity a3+b1×similarity identification value I, wherein a1, A2, A3, b1 are weight coefficients greater than 0 and a1+a2+a3+b1=1
If the first similarity value A1 is not greater than the first threshold value, processing the image data in the potential bidding data by using an image processing algorithm to obtain the potential bidding image data;
judging whether a fourth similarity value A4 of the potential bidding image data is smaller than a fourth threshold value by using a fourth similarity judging model, and judging whether a fifth similarity value A5 of the potential bidding image data is smaller than a fifth threshold value by using a fifth similarity judging model;
if the fourth similarity value A4 is not smaller than the fourth threshold value or the fifth similarity value A5 is not smaller than the fifth threshold value, adding 1 to the similarity identification value I, and calculating a second similarity S2 by using a second similarity calculation method;
if the fourth similarity value A4 is smaller than the fourth threshold value or the fifth similarity value A5 is smaller than the fifth threshold value, calculating the second similarity S2 by using the second similarity calculation method;
The second similarity calculation method comprises the following steps: the second similarity s2=a6×first similarity a1+a4×fourth similarity a4+a5×fifth similarity a5+b2×similarity identification value I, wherein a4, A5, a6, b2 are weight coefficients greater than 0 and a4+a5+a6+b2=1;
judging whether the first similarity S1 or the second similarity S2 is not smaller than the preset threshold, if yes, marking the potential bid data as similar, and if not, marking the potential bid data as dissimilar;
and extracting all data marked as similar in the potential bid data as the bid data.
It can be understood that from two dimensions of the text and the image, multiple models can be constructed according to their respective characteristics to calculate the similarity, and finally the results given by the multiple models are weighted and summed to determine whether the results are truly similar bid products.
In this embodiment, first, a first similarity determination model is used to perform a preliminary determination (may be a similarity determination on text data), and when a first similarity value A1 is obtained to be greater than a first threshold (e.g. 80%), a second determination is further performed from other dimensions/precision or models trained by using other algorithms to improve accuracy, for example, a second similarity determination model is used to determine whether a second similarity value A2 of the potential bid data is less than a second threshold, and/or a third similarity determination model is used to determine whether a third similarity value A3 of the potential bid data is less than a third threshold. The second similarity determination model and the third similarity determination model may be models (or models of other dimensions) that perform similarity determination on text data; if the second similarity value A2 is smaller than the second threshold value or the third similarity value A3 is smaller than the third threshold value, adding 1 to the similarity identification value I, and calculating the first similarity S1 by using a first similarity calculation method, where, for the potential bid data with higher text comparison similarity, other accuracy/dimension judgment models can be used to perform further judgment, when the obtained second similarity value A2 is smaller than the second threshold value (e.g. 60%) or the obtained third similarity value A3 is smaller than the third threshold value (e.g. 50%), it is indicated that there is a possibility of erroneous judgment in the preliminary judgment, adding 1 to the similarity identification value I to reduce the weight of the previous three judgment models, and calculating the first similarity S1 by using the first similarity calculation method. And if the second similarity value A2 is not smaller than the second threshold value or the third similarity value A3 is not smaller than the third threshold value, calculating the first similarity S1 by using the first similarity calculation method. In some embodiments, the second similarity determination model and the third similarity determination model may be a model for performing similarity determination on image data, or may be one of the two models for performing similarity determination on image data, and one of the two models is a model (or other model) for performing similarity determination on text data.
It may be appreciated that if the first similarity value A1 is not greater than the first threshold value, processing the image data in the potential bid image data by using an image processing algorithm to obtain potential bid image data, determining whether a fourth similarity value A4 of the potential bid image data is less than a fourth threshold value by using a fourth similarity determination model, and determining whether a fifth similarity value A5 of the potential bid image data is less than a fifth threshold value by using a fifth similarity determination model; if the fourth similarity value A4 is not smaller than the fourth threshold value or the fifth similarity value A5 is not smaller than the fifth threshold value, indicating that there may be erroneous judgment in the preliminary judgment, adding 1 to the similarity identification value I to reduce the weights of the output results of the first, fourth and fifth similarity judgment models, and calculating the second similarity S2 by using a second similarity calculation method; and if the fourth similarity value A4 is smaller than the fourth threshold value or the fifth similarity value A5 is smaller than the fifth threshold value, calculating the second similarity S2 by using the second similarity calculation method. In the embodiment, the model for judging the similarity of the image data is added through two models with different accuracies (or trained by different algorithms), so that the judging accuracy is improved, and the problem that the real bid data is missed due to poor comparison results of the text data can be avoided.
Referring to fig. 3, in some possible embodiments of the present invention, the step of obtaining the product description data and extracting the first search word from the product description data includes:
step one: classifying the commodity description data according to commodity names and commodity attributes, and performing text preprocessing on the classified commodity description data to generate candidate search word sequences;
step two: extracting feature data of the candidate search word sequence, and labeling the feature data to obtain a labeled sample set and a non-labeled sample set;
step three: using the labeling sample set as a training set, and training a search word classification model by using a neural network;
step four: classifying and predicting candidate search words in the unlabeled sample set by using the trained search word classification model, and calculating the matching degree of each unlabeled sample;
step five: selecting the corresponding unlabeled sample with the matching degree exceeding a preset matching degree value, adding the unlabeled sample into the training set, and retraining the search word classification model;
step six: repeating the fourth step to the fifth step until the proportion of the matching degree of each unlabeled sample, which is higher than the preset matching degree value, exceeds the preset proportion, so as to obtain a final search word classification model;
Step seven: and inputting the characteristic data of the commodity description data into the final search word classification model for processing, and extracting the first search word from the processing result.
It may be appreciated that in this embodiment, after extracting the feature data of the candidate search term sequence, a part of the feature data is labeled to obtain a labeled sample set, and another part of the feature data is a non-labeled sample set, the search term classification model is trained by using labeled sample set data through a neural network, and then the search term classification model is further trained by using non-labeled sample set data until the proportion of the matching degree value set of each non-labeled sample, which is higher than the preset matching degree value, exceeds the preset proportion, thereby achieving the purpose of improving the performance of the search term classification model.
Referring to fig. 4, in some possible embodiments of the present invention, the first step: classifying the commodity description data according to commodity names and commodity attributes, and generating candidate search word sequences after text preprocessing of the classified commodity description data, wherein the method comprises the following steps:
extracting text data from the commodity description data;
Counting and numbering all sentences in the text data;
dividing the sentence into a plurality of words, and recording the position information of the words in the sentence;
analyzing and marking the part of speech of the words;
deleting a first word with a preset part of speech from the words to obtain a modified word set;
performing de-duplication operation on the modified word set to obtain a candidate word set;
classifying the candidate word set according to commodity names and commodity attributes;
and performing text preprocessing on the classified candidate word set to generate the candidate search word sequence.
It can be understood that, in order to improve the accuracy of text recognition and judgment, in this embodiment, text data is extracted from the commodity description data, all sentences in the text data are counted and numbered, the sentences are divided into a plurality of words, and the position information of the words in the sentences is recorded; analyzing and marking the part of speech of the words; deleting a first word which is of preset part of speech (such as adjective, adverb, pronoun, auxiliary word and the like) from the words and generates nonsensical first words for the keywords, so as to obtain a modified word set; performing de-duplication operation on the modified word set to obtain a candidate word set; classifying the candidate word set according to commodity names and commodity attributes; and performing text preprocessing on the classified candidate word set to generate the candidate search word sequence.
In some possible embodiments of the present invention, the extracting the feature data of the candidate search word sequence in the second step includes:
generating a first word vector table by using the trained word vector model;
generating a candidate search word vector sequence corresponding to the candidate search word sequence according to the first word vector table;
dividing the candidate search word vector sequences into n clusters according to the distance between the candidate search word vector sequences;
generating cluster center vectors of the n clusters according to a clustering algorithm;
quantifying the relation between the candidate search word sequence and the clustering center vector according to a distance formula to obtain semantic features of the candidate search word sequence;
extracting language features, word frequency features, length features and position features from the semantic features as the feature data.
It can be understood that, in this embodiment, in order to improve the efficiency and accuracy of feature data extraction, a candidate search word vector sequence corresponding to the candidate search word sequence is generated by vectorizing feature data of the candidate search word sequence to perform vector operation; dividing the candidate search word vector sequences into n clusters according to the distance between the candidate search word vector sequences; generating cluster center vectors of the n clusters according to a clustering algorithm, and quantifying the relation between the candidate search word sequences and the cluster center vectors according to a Euclidean distance formula to obtain semantic features of the candidate search word sequences; extracting language features, word frequency features, length features and position features from the semantic features as the feature data.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, such as the above-described division of units, merely a division of logic functions, and there may be additional manners of dividing in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the above-mentioned method of the various embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.
The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the above examples being provided solely to assist in the understanding of the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.
Although the present invention is disclosed above, the present invention is not limited thereto. Variations and modifications, including combinations of the different functions and implementation steps, as well as embodiments of the software and hardware, may be readily apparent to those skilled in the art without departing from the spirit and scope of the invention.

Claims (8)

1. An artificial intelligence based keyword generation system, comprising: the device comprises an extraction module, a data processing module and a generation module;
the extraction module is configured to:
acquiring commodity description data, and extracting a first search word from the commodity description data;
acquiring potential bid item data of the commodity according to the first search word;
the data processing module is configured to:
processing the potential bid data by using an image processing algorithm, and filtering out the bid data with the similarity lower than a preset threshold value to obtain bid data;
extracting bid title data from the bid data;
extracting core commodity words from the bid title data;
selecting a first core commodity word with frequency higher than a preset frequency value from the core commodity words by combining a preset search word data set;
the generation module is configured to: generating keywords corresponding to the commodity according to the first core commodity word and combining keyword generation rules;
in the step of processing the potential bid data by using an image processing algorithm and filtering the bid data with the similarity lower than a preset threshold value to obtain the bid data, the data processing module is specifically configured to:
Inputting the potential bid data, and marking a similarity identification value I as 0;
judging whether a first similarity value A1 of the potential bid data is larger than a first threshold value or not by using a first similarity judging model;
if the first similarity value A1 is larger than the first threshold value, judging whether a second similarity value A2 of the potential bid data is smaller than a second threshold value by using a second similarity judging model, and judging whether a third similarity value A3 of the potential bid data is smaller than a third threshold value by using a third similarity judging model;
if the second similarity value A2 is smaller than the second threshold value or the third similarity value A3 is smaller than the third threshold value, adding 1 to the similarity identification value I, and calculating a first similarity S1 by using a first similarity calculation method;
if the second similarity value A2 is not smaller than the second threshold value or the third similarity value A3 is not smaller than the third threshold value, calculating the first similarity S1 by using the first similarity calculation method;
the first similarity calculation method comprises the following steps: the first similarity s1=a1×first similarity a1+a2×second similarity a2+a3×third similarity a3+b1×similarity identification value I, wherein a1, A2, A3, b1 are weight coefficients greater than 0 and a1+a2+a3+b1=1;
If the first similarity value A1 is not greater than the first threshold value, processing the image data in the potential bidding data by using an image processing algorithm to obtain the potential bidding image data;
judging whether a fourth similarity value A4 of the potential bidding image data is smaller than a fourth threshold value by using a fourth similarity judging model, and judging whether a fifth similarity value A5 of the potential bidding image data is smaller than a fifth threshold value by using a fifth similarity judging model;
if the fourth similarity value A4 is not smaller than the fourth threshold value or the fifth similarity value A5 is not smaller than the fifth threshold value, adding 1 to the similarity identification value I, and calculating a second similarity S2 by using a second similarity calculation method;
if the fourth similarity value A4 is smaller than the fourth threshold value or the fifth similarity value A5 is smaller than the fifth threshold value, calculating the second similarity S2 by using the second similarity calculation method;
the second similarity calculation method comprises the following steps: the second similarity s2=a6×first similarity a1+a4×fourth similarity a4+a5×fifth similarity a5+b2×similarity identification value I, wherein a4, A5, a6, b2 are weight coefficients greater than 0 and a4+a5+a6+b2=1;
Judging whether the first similarity S1 or the second similarity S2 is not smaller than the preset threshold, if yes, marking the potential bid data as similar, and if not, marking the potential bid data as dissimilar;
and extracting all data marked as similar in the potential bid data as the bid data.
2. The artificial intelligence based keyword generation system of claim 1, wherein in the step of obtaining commodity description data and extracting a first search term from the commodity description data, the extraction module is specifically configured to:
step one: classifying the commodity description data according to commodity names and commodity attributes, and performing text preprocessing on the classified commodity description data to generate candidate search word sequences;
step two: extracting feature data of the candidate search word sequence, and labeling the feature data to obtain a labeled sample set and a non-labeled sample set;
step three: using the labeling sample set as a training set, and training a search word classification model by using a neural network;
step four: classifying and predicting candidate search words in the unlabeled sample set by using the trained search word classification model, and calculating the matching degree of each unlabeled sample;
Step five: selecting the corresponding unlabeled sample with the matching degree exceeding a preset matching degree value, adding the unlabeled sample into the training set, and retraining the search word classification model;
step six: repeating the fourth step to the fifth step until the proportion of the matching degree of each unlabeled sample, which is higher than the preset matching degree value, exceeds the preset proportion, so as to obtain a final search word classification model;
step seven: and inputting the characteristic data of the commodity description data into the final search word classification model for processing, and extracting the first search word from the processing result.
3. The artificial intelligence based keyword generation system of claim 2, wherein the first step: classifying the commodity description data according to commodity names and commodity attributes, and performing text preprocessing on the classified commodity description data to generate candidate search word sequences, wherein the extraction module is specifically configured to:
extracting text data from the commodity description data;
counting and numbering all sentences in the text data;
dividing the sentence into a plurality of words, and recording the position information of the words in the sentence;
Analyzing and marking the part of speech of the words;
deleting a first word with a preset part of speech from the words to obtain a modified word set;
performing de-duplication operation on the modified word set to obtain a candidate word set;
classifying the candidate word set according to commodity names and commodity attributes;
and performing text preprocessing on the classified candidate word set to generate the candidate search word sequence.
4. The artificial intelligence based keyword generation system of claim 3, wherein in the extracting feature data of the candidate search word sequence in the second step, the extracting module is specifically configured to:
generating a first word vector table by using the trained word vector model;
generating a candidate search word vector sequence corresponding to the candidate search word sequence according to the first word vector table;
dividing the candidate search word vector sequences into n clusters according to the distance between the candidate search word vector sequences;
generating cluster center vectors of the n clusters according to a clustering algorithm;
quantifying the relation between the candidate search word sequence and the clustering center vector according to a distance formula to obtain semantic features of the candidate search word sequence;
Extracting language features, word frequency features, length features and position features from the semantic features as the feature data.
5. The keyword generation method based on the artificial intelligence is characterized by comprising the following steps of:
acquiring commodity description data, and extracting a first search word from the commodity description data;
acquiring potential bid item data of the commodity according to the first search word;
processing the potential bid data by using an image processing algorithm, and filtering out the bid data with the similarity lower than a preset threshold value to obtain bid data;
extracting bid title data from the bid data;
extracting core commodity words from the bid title data;
selecting a first core commodity word with frequency higher than a preset frequency value from the core commodity words by combining a preset search word data set;
generating keywords corresponding to the commodity according to the first core commodity word and combining keyword generation rules;
the step of processing the potential bid data by using an image processing algorithm and filtering the bid data with similarity lower than a preset threshold value to obtain the bid data comprises the following steps:
Inputting the potential bid data, and marking a similarity identification value I as 0;
judging whether a first similarity value A1 of the potential bid data is larger than a first threshold value or not by using a first similarity judging model;
if the first similarity value A1 is larger than the first threshold value, judging whether a second similarity value A2 of the potential bid data is smaller than a second threshold value by using a second similarity judging model, and judging whether a third similarity value A3 of the potential bid data is smaller than a third threshold value by using a third similarity judging model;
if the second similarity value A2 is smaller than the second threshold value or the third similarity value A3 is smaller than the third threshold value, adding 1 to the similarity identification value I, and calculating a first similarity S1 by using a first similarity calculation method;
if the second similarity value A2 is not smaller than the second threshold value or the third similarity value A3 is not smaller than the third threshold value, calculating the first similarity S1 by using the first similarity calculation method;
the first similarity calculation method comprises the following steps: the first similarity s1=a1×first similarity a1+a2×second similarity a2+a3×third similarity a3+b1×similarity identification value I, wherein a1, A2, A3, b1 are weight coefficients greater than 0 and a1+a2+a3+b1=1;
If the first similarity value A1 is not greater than the first threshold value, processing the image data in the potential bidding data by using an image processing algorithm to obtain the potential bidding image data;
judging whether a fourth similarity value A4 of the potential bidding image data is smaller than a fourth threshold value by using a fourth similarity judging model, and judging whether a fifth similarity value A5 of the potential bidding image data is smaller than a fifth threshold value by using a fifth similarity judging model;
if the fourth similarity value A4 is not smaller than the fourth threshold value or the fifth similarity value A5 is not smaller than the fifth threshold value, adding 1 to the similarity identification value I, and calculating a second similarity S2 by using a second similarity calculation method;
if the fourth similarity value A4 is smaller than the fourth threshold value or the fifth similarity value A5 is smaller than the fifth threshold value, calculating the second similarity S2 by using the second similarity calculation method;
the second similarity calculation method comprises the following steps: the second similarity s2=a6×first similarity a1+a4×fourth similarity a4+a5×fifth similarity a5+b2×similarity identification value I, wherein a4, A5, a6, b2 are weight coefficients greater than 0 and a4+a5+a6+b2=1;
Judging whether the first similarity S1 or the second similarity S2 is not smaller than the preset threshold, if yes, marking the potential bid data as similar, and if not, marking the potential bid data as dissimilar;
and extracting all data marked as similar in the potential bid data as the bid data.
6. The artificial intelligence based keyword generation method of claim 5, wherein the step of obtaining commodity description data and extracting a first search term from the commodity description data comprises:
step one: classifying the commodity description data according to commodity names and commodity attributes, and performing text preprocessing on the classified commodity description data to generate candidate search word sequences;
step two: extracting feature data of the candidate search word sequence, and labeling the feature data to obtain a labeled sample set and a non-labeled sample set;
step three: using the labeling sample set as a training set, and training a search word classification model by using a neural network;
step four: classifying and predicting candidate search words in the unlabeled sample set by using the trained search word classification model, and calculating the matching degree of each unlabeled sample;
Step five: selecting the corresponding unlabeled sample with the matching degree exceeding a preset matching degree value, adding the unlabeled sample into the training set, and retraining the search word classification model;
step six: repeating the fourth step to the fifth step until the proportion of the matching degree of each unlabeled sample, which is higher than the preset matching degree value, exceeds the preset proportion, so as to obtain a final search word classification model;
step seven: and inputting the characteristic data of the commodity description data into the final search word classification model for processing, and extracting the first search word from the processing result.
7. The method for generating keywords based on artificial intelligence according to claim 6, wherein the first step is: classifying the commodity description data according to commodity names and commodity attributes, and generating candidate search word sequences after text preprocessing of the classified commodity description data, wherein the method comprises the following steps:
extracting text data from the commodity description data;
counting and numbering all sentences in the text data;
dividing the sentence into a plurality of words, and recording the position information of the words in the sentence;
analyzing and marking the part of speech of the words;
Deleting a first word with a preset part of speech from the words to obtain a modified word set;
performing de-duplication operation on the modified word set to obtain a candidate word set;
classifying the candidate word set according to commodity names and commodity attributes;
and performing text preprocessing on the classified candidate word set to generate the candidate search word sequence.
8. The method of claim 7, wherein the extracting feature data of the candidate search word sequence in the second step comprises:
generating a first word vector table by using the trained word vector model;
generating a candidate search word vector sequence corresponding to the candidate search word sequence according to the first word vector table;
dividing the candidate search word vector sequences into n clusters according to the distance between the candidate search word vector sequences;
generating cluster center vectors of the n clusters according to a clustering algorithm;
quantifying the relation between the candidate search word sequence and the clustering center vector according to a distance formula to obtain semantic features of the candidate search word sequence;
extracting language features, word frequency features, length features and position features from the semantic features as the feature data.
CN202211294577.9A 2022-10-21 2022-10-21 Keyword generation system and method based on artificial intelligence Active CN115470322B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211294577.9A CN115470322B (en) 2022-10-21 2022-10-21 Keyword generation system and method based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211294577.9A CN115470322B (en) 2022-10-21 2022-10-21 Keyword generation system and method based on artificial intelligence

Publications (2)

Publication Number Publication Date
CN115470322A CN115470322A (en) 2022-12-13
CN115470322B true CN115470322B (en) 2023-05-05

Family

ID=84336356

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211294577.9A Active CN115470322B (en) 2022-10-21 2022-10-21 Keyword generation system and method based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN115470322B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984554A (en) * 2017-06-01 2018-12-11 北京京东尚科信息技术有限公司 Method and apparatus for determining keyword
CN111191022A (en) * 2019-12-27 2020-05-22 苏宁云计算有限公司 Method and device for generating short titles of commodities
CN113343684A (en) * 2021-06-22 2021-09-03 广州华多网络科技有限公司 Core product word recognition method and device, computer equipment and storage medium
WO2022134759A1 (en) * 2020-12-21 2022-06-30 深圳壹账通智能科技有限公司 Keyword generation method and apparatus, and electronic device and computer storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10565533B2 (en) * 2014-05-09 2020-02-18 Camelot Uk Bidco Limited Systems and methods for similarity and context measures for trademark and service mark analysis and repository searches
CN113468414A (en) * 2021-06-07 2021-10-01 广州华多网络科技有限公司 Commodity searching method and device, computer equipment and storage medium
CN113570413B (en) * 2021-07-28 2023-12-05 杭州王道控股有限公司 Advertisement keyword generation method and device, storage medium and electronic equipment
CN114579896A (en) * 2022-03-04 2022-06-03 拉扎斯网络科技(上海)有限公司 Generation method and display method of recommended label, corresponding device and electronic equipment
CN114663164A (en) * 2022-04-12 2022-06-24 广州欢聚时代信息科技有限公司 E-commerce site popularization and configuration method and device, equipment, medium and product thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984554A (en) * 2017-06-01 2018-12-11 北京京东尚科信息技术有限公司 Method and apparatus for determining keyword
CN111191022A (en) * 2019-12-27 2020-05-22 苏宁云计算有限公司 Method and device for generating short titles of commodities
WO2022134759A1 (en) * 2020-12-21 2022-06-30 深圳壹账通智能科技有限公司 Keyword generation method and apparatus, and electronic device and computer storage medium
CN113343684A (en) * 2021-06-22 2021-09-03 广州华多网络科技有限公司 Core product word recognition method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN115470322A (en) 2022-12-13

Similar Documents

Publication Publication Date Title
CN109977413B (en) Emotion analysis method based on improved CNN-LDA
CN109933664B (en) Fine-grained emotion analysis improvement method based on emotion word embedding
CN108509465B (en) Video data recommendation method and device and server
Jin et al. A novel lexicalized HMM-based learning framework for web opinion mining
CN110209805B (en) Text classification method, apparatus, storage medium and computer device
CN107944911B (en) Recommendation method of recommendation system based on text analysis
CN104881458B (en) A kind of mask method and device of Web page subject
CN110909164A (en) Text enhancement semantic classification method and system based on convolutional neural network
CN111460130A (en) Information recommendation method, device, equipment and readable storage medium
CN110633373A (en) Automobile public opinion analysis method based on knowledge graph and deep learning
CN111104526A (en) Financial label extraction method and system based on keyword semantics
Homoceanu et al. Will I like it? Providing product overviews based on opinion excerpts
CN109993448A (en) A kind of appraisal procedure and system of enterprise network public sentiment potential risk
CN112861541A (en) Commodity comment sentiment analysis method based on multi-feature fusion
CN111191022A (en) Method and device for generating short titles of commodities
CN110866102A (en) Search processing method
Rani et al. Study and comparision of vectorization techniques used in text classification
CN113837824A (en) Information pushing method and system
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN108717637B (en) Automatic mining method and system for E-commerce safety related entities
CN111325019A (en) Word bank updating method and device and electronic equipment
CN115470322B (en) Keyword generation system and method based on artificial intelligence
CN114943285B (en) Intelligent auditing system for internet news content data
CN114048294B (en) Similar population extension model training method, similar population extension method and device
CN115659961A (en) Method, apparatus and computer storage medium for extracting text viewpoints

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant