CN113821631A

CN113821631A - Commodity matching method based on big data

Info

Publication number: CN113821631A
Application number: CN202110074878.XA
Authority: CN
Inventors: 甘洪霖; 梁天爵; 王睿
Original assignee: Guangdong Information Network Co ltd
Current assignee: Guangdong Information Network Co ltd
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2021-12-21
Anticipated expiration: 2041-01-20
Also published as: CN113821631B

Abstract

The invention discloses a commodity matching method based on big data, which respectively combines texts of commodity data and data corresponding to videos into a plurality of text data, and extracts TF characteristic vectors by establishing a TF characteristic matrix to reduce mutual interference caused by the same key words caused by the existence of data of a plurality of products in a video. The commodity matching method based on big data can be widely applied to the field of data processing.

Description

Commodity matching method based on big data

Technical Field

The invention relates to the field of data processing, in particular to a commodity matching method based on big data.

Background

The continuous development of electronic commerce brings huge data amount and complex data form to people, and the associated form of data is changing, which brings great challenges to data processing. Taking abundant network commodities in the internet environment as an example, with the improvement of commodity diversity, the same network commodity often exists in various forms on different network platforms, similar commodity information is scattered on each network platform, and the search space of the network commodity is continuously increased, so that a consumer consumes a great deal of energy on searching and integrating interested commodity information. If the commodity information scattered on each network platform is associated, the search and integration of the commodity information are facilitated, so that the heavy search work of consumers is avoided. In order to correlate the commodity information from different network platforms, the primary task is to complete the matching of the commodities, i.e. to find out different expressions of the same network commodity from different network platforms.

Currently, algorithms for network-based commodity matching mainly include a WHIRL (word-based predictive information retrieval) algorithm, a tmwm (title model words method), and an SSM (synthesized similarity method). The WHIRL algorithm mainly adopts TF-IDF and a vector space model to model commodity citation and calculates the similarity between commodities; the TMWM incorporates normalized edit distance between commodity names, as well as similarity between the "pattern word" set extracted from the names. And the SSM effectively combines the information acquired from the commodity name and the information acquired from the key value pair of the commodity attribute to construct a comprehensive similarity calculation method, and finally determines whether the commodities are matched or not according to the similarity value. The introduced Aggregation Hierarchical Clustering (AHC) framework can realize multi-source commodity data matching on the premise of not influencing the matching accuracy.

Taking the WHIRL algorithm as an example, it has the following problems: 1. network commodities are often distributed on different network platforms, and an algorithm can only process commodity data from two data providers but cannot process the commodity data from a plurality of network platforms; 2. the network commodity data has the characteristics of unstructured data, isomerism and the like, and the algorithm can only compare commodity names with the structured data, but cannot calculate similarity of the unstructured commodity data; 3. the TF-IDF and the vector space model are adopted to model commodity reference, the method is mainly limited to processing of text data, and the algorithm is not directed to large data of a shopping platform with the rise of social media, and particularly comprises a large amount of video data.

Disclosure of Invention

In order to solve the technical problems, the invention aims to: a method for matching commodities with commodity data containing video data based on big data is provided.

The technical scheme adopted by the invention is as follows: a commodity matching method based on big data comprises the following steps:

acquiring commodity data, wherein the commodity data comprises first text data and video data;

acquiring second text data corresponding to audio data in the video data, and constructing a classification data set according to the first text data and the second text data;

using partial data in the classification data set as training data of an SVM classifier to perform classification labeling and Chinese word segmentation, and dividing each type of data in the training data into a training set and a test set;

aiming at each type of data in the training set, combining all the first text data and all the second text data into third text data, calculating key words of the third text data and TF-IDF values corresponding to the key words, and sequencing the key words in a descending manner according to the TF-IDF values to obtain corresponding word lists and IDF characteristic vectors corresponding to the word lists;

aiming at each type of data in a training set, combining all first text data with at least one of second text data respectively to form a plurality of fourth text data, calculating TF feature matrices of the plurality of fourth text data, and extracting TF feature vectors from the TF feature matrices according to a preset mode;

obtaining TF-IDF characteristic vectors corresponding to the word list according to the IDF characteristic vectors and the TF characteristic vectors of each type of data in the training set;

establishing a keyword list and a feature dictionary according to the word list of each type of data in the training set and the corresponding TF-IDF feature vector;

mapping the first text data and the second text data in the training set to a keyword list to obtain a text characteristic vector;

and training an SVM classifier by using the text feature vectors generated by the data corresponding to the training set and the test set, and classifying the classified data set according to the finally obtained SVM classifier so as to obtain a commodity matching result.

Further, the step of performing classification labeling and Chinese word segmentation by using partial data in the classification data set as training data of the SVM classifier includes

Labeling the corresponding second text data according to the video data type, wherein:

second text data corresponding to the first video data type is marked as fifth text data, and the first video data corresponds to a video of a single product;

second text data corresponding to a second video data type is marked as sixth text data, and the second video data corresponds to a single product type or a video of a single product brand;

and marking second text data corresponding to a third video data type as seventh text data, wherein the third video data corresponds to videos of a plurality of product categories and product brands.

Further, the combining, for each class of data in the training set, all the first text data with at least one of the second text data respectively as a plurality of fourth text data specifically includes

For each type of data in the training set, all the first text data are combined with each of the sixth text data as a plurality of fourth text data, and all the first text data are combined with each of the seventh text data as a plurality of fourth text data.

For each type of data in the training set, all of the first text data and all of the fifth text data are combined with each of the sixth text data as a plurality of fourth text data, and all of the first text data and all of the fifth text data are combined with each of the seventh text data as a plurality of fourth text data.

And for each type of data in the training set, combining all the first text data with each second text data respectively to form a plurality of fourth text data.

Further, the extracting the TF feature vectors from the TF feature matrix according to the preset mode specifically comprises

And arranging the elements of each column in the TF feature matrix according to the numerical value in a descending order, and extracting diagonal elements in the TF feature matrix to serve as TF feature vectors.

Further, the step of establishing a keyword list and a feature dictionary according to the word list of each type of data in the training set and the corresponding TF-IDF feature vector comprises the steps of

And adding the word with the TF-IDF value larger than the set threshold value in the word list into the keyword list, and adding the word and the TF-IDF value corresponding to the word into the feature dictionary.

Further, in the training process of the SVM classifier, a plurality of set thresholds are adopted for calculation to obtain a plurality of SVM classifiers, and the SVM classifier with the highest classification accuracy is selected for classifying the classification data set so as to obtain a commodity matching result.

Further, the step of mapping the first text data and the second text data in the training set to the keyword list to obtain the text feature vector and construct the text feature vector comprises the steps of

And establishing an initialization vector with the same length as the keyword list for each piece of first text data and second text data in the training set, and assigning the corresponding element value in the vector to the value of the corresponding keyword in the feature dictionary to obtain the text feature vector.

Further, performing power calculation on elements in the text feature vector to obtain an enhanced text feature vector.

The invention has the beneficial effects that: the method has the advantages that the texts of the commodity data and the data corresponding to the videos are respectively combined into the plurality of text data, the TF characteristic vectors are extracted by establishing the TF characteristic matrix, the mutual interference caused by the same key words due to the existence of the data of a plurality of products in one video is reduced, other word bases or word vector dictionaries do not need to be constructed or trained in advance, useful information is effectively extracted from the videos, and the accuracy of intelligent commodity matching can be greatly improved.

Drawings

FIG. 1 is a flow chart of steps of an embodiment of the present invention;

fig. 2 is a flowchart illustrating specific steps of parameter tuning according to an embodiment of the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the accompanying drawings:

referring to fig. 1, as an embodiment of the present invention, a commodity matching method based on big data includes the following specific steps:

s1, acquiring commodity data, wherein the commodity data comprises first text data and video data;

the first text data generally comprises a commodity name corresponding to a certain product, commodity introduction characters, product parameters and other information; the big data of the shopping platform with the rise of social media contains a large amount of video data, such as product introduction videos of the e-commerce platform, live videos of merchants of the e-commerce platform and live videos of hosts of the e-commerce platform.

S2, second text data corresponding to the audio data in the video data are obtained, and a classification data set is constructed according to the first text data and the second text data;

specific ways of acquiring the second text data may include, but are not limited to, the following ways:

(1) directly acquiring subtitle information in the video data as second text data;

(2) firstly, audio data in video data are extracted, and then the audio data are converted into second text data through voice text conversion; specifically, the adopted voice text is converted into a mature technology, and detailed description is not needed;

(3) and acquiring second text data corresponding to the pre-processed video data from a third-party database.

S3, using partial data in the classification data set as training data of the SVM classifier to perform classification labeling and Chinese word segmentation, and dividing each class of data in the training data into a training set and a test set;

the classification labels are used for training the SVM classifier subsequently, the commodity matching accuracy of the output result of the SVM classifier is higher, and the Chinese word segmentation can adopt a conventional algorithm, such as classical Chinese ending word segmentation. According to the classification labeling information, randomly selecting a part of each type of data in the training data as training set data, using the rest data as test set data, and generally setting the data quantity ratio of the training set data to the test set data to be 2: 1.

s4, aiming at each type of data in the training set, combining all the first text data and all the second text data into third text data, calculating keywords of the third text data and TF-IDF values corresponding to the keywords, and arranging the keywords in a descending order according to the TF-IDF values to obtain a corresponding word list and IDF feature vectors corresponding to the word list;

for example, n first text data (t)₁,t₂,…,t_n) And m second text data (v)₁,v₂,…,v_m) Are combined into a third text data (t)₁+t₂+…+t_n+v₁+v₂+…+v_m) And calculating the key word of the third text data and the corresponding TF-IDF value to obtain a corresponding word list

{word₁,word₂,…,word_y}

And its corresponding IDF feature vector

[idf₁,idf₂,…,idf_y]

Accordingly, the TF-IDF values corresponding to the word list are:

tfidf₁,tfidf₂,…,tfidf_y

wherein tfidf₁>tfidf₂>…>tfidf_y(ii) a TF-IDF values, i.e. the product of TF and IDF values, tfidf_k＝tf_k*idf_k，k＝1，2，…，y。

S5, aiming at each type of data in the training set, combining all the first text data with at least one of the second text data respectively to form a plurality of fourth text data, calculating TF feature matrices of the plurality of fourth text data, and extracting TF feature vectors from the TF feature matrices according to a preset mode;

for example, n first text data (t)₁,t₂,……,t_n) Respectively with the second text data (v)₁,v₂,…,v_m) Middle v₁,v₂And v₃Merge into three fourth text data:

t₁+t₂+…+t_n+v₁，t₁+t₂+…+t_n+v₂，t₁+t₂+…+t_n+v₃

and then calculating TF feature matrices of the plurality of fourth text data, and extracting TF feature vectors from the TF feature matrices according to a preset mode. Wherein the TF value of each fourth text data corresponds to a row in the TF feature matrix.

Taking the above three fourth text data as an example, the TF feature matrix can be expressed as:

[tf₁₁,tf₁₂,…,tf_1y，

tf₂₁,tf₂₂,…,tf_2y，

tf₃₁,tf₃₂,…,tf_3y]

selecting an element from each column in the TF feature matrix to form a TF feature vector, e.g., [ TF [₁₁,tf₁₂,…，tf_3y]。

S6, obtaining TF-IDF characteristic vectors corresponding to the word list according to the IDF characteristic vectors and the TF characteristic vectors of each type of data in the training set;

according to the data, word list { word can be obtained₁,word₂,…,word_yAnd the corresponding TF-IDF feature vector [ TF₁₁*idf₁,tf₁₂*idf₂,…，tf_3y*idf_y]。

S7, establishing a keyword list and a feature dictionary according to the word list of each type of data in the training set and the TF-IDF feature vector corresponding to the word list;

further, as a preferred embodiment, step S7 specifically includes: and adding words with TF-IDF values larger than a set threshold value in N word lists corresponding to N types of data in the training set into the keyword list, and adding the words and the TF-IDF values corresponding to the words into the feature dictionary.

Further, as a preferred embodiment, since the N word lists may include repeated words, when a word in the word list is added to the keyword list in step S7, if the word already exists in the keyword list, it is determined whether the TF-IDF value corresponding to the word is greater than the TF-IDF value corresponding to the feature dictionary, and if so, the TF-IDF value corresponding to the current word is assigned to the TF-IDF value corresponding to the feature dictionary.

Further preferably, step S7 further includes: setting a number of words threshold num for a keyword_thCalculating the number of the elements of the intersection of the word list of each type of data and the keyword list, and if the number of the elements is less than the threshold num of the number of words_thThen the word list { word } needs to be listed₁,word₂,…,word_yAccording to its corresponding TF-IDF feature vector [ TF₁₁*idf₁,tf₁₂*idf₂,…，tf_3y*idf_y]Reordering, i.e. arranging the values in the TF-IDF feature vector in descending order andrearrange the corresponding word list and then find the top num_thThe individual words are added to the keyword list and the feature dictionary is updated accordingly. Here, the word list { word } needs to be rearranged₁,word₂,…,word_yBecause the order of words in the word list is according to the TF-IDF value [ tfidf ]₁,tfidf₂,…,tfidf_y]The permutation is performed, and the TF-IDF feature vector [ TF₁₁*idf₁,tf₁₂*idf₂,…，tf_3y*idf_y]The TF-IDF values in (1) are modified values, and the order may not be descending order.

S8, mapping the first text data and the second text data in the training set to a keyword list to obtain a text feature vector;

further, as a preferred embodiment, for each piece of the first text data and the second text data in the training set, an initialization vector with the same length as the keyword list is established, and the value of the corresponding element in the vector is assigned as the value of the corresponding keyword in the feature dictionary, so as to obtain the text feature vector.

And S9, training an SVM classifier by using the text feature vectors generated by the data corresponding to the training set and the test set, and classifying the classified data set according to the finally obtained SVM classifier so as to obtain a commodity matching result.

In the above embodiments, it has been mentioned that the big data of the shopping platform with the rise of social media includes a large amount of video data, such as skatecat, kyoto and the like of the traditional e-commerce platform, a short video platform with a new rise, and the like; further as a preferred embodiment, in step S3, the corresponding second text data is further labeled according to the video data type, where:

the second text data corresponding to the first video data type is labeled as fifth text data (v)₁₁，v₁₂，…，v_1i) The first video data corresponds to a video of a single product, for example, a product page is accessed through a Jingdong link, and a product introduction video is displayed on the page;

the second text data corresponding to the second video data type is labeled as sixth text data (v)₂₁，v₂₂，…，v_2j) The second video data corresponds to a single product category or a video of a single product brand, such as a live video of a tianmao merchant, wherein the live video content may involve multiple products of the merchant. For example, if the merchant is a mobile phone exclusive shop, the live video content may relate to a single mobile phone product category, and the brand may relate to apple, Huashi, millet, and the like; for example, if the merchant is a millet monopoly, the live video content may relate to a single product brand, and the product may relate to a millet mobile phone, a smart watch, an electric cooker, a sweeping robot, and the like.

The second text data corresponding to the third video data type is labeled as seventh text data (v)₃₁，v₃₂，…，v_3k) The third video data corresponds to videos of a plurality of product categories and a plurality of product brands. For example, live video of some integrated merchants, where the live video content may relate to different product categories and different product brands of the merchants; for another example, the live video hosted by the platform has a larger difference between product types/brands that may be involved in the live video content, and home appliances, cosmetics, snacks, and the like may be mentioned in the same live video.

Further as a preferred embodiment, the specific implementation in step S5 is to combine all the first text data (t) for each class of data in the training set₁,t₂,…,t_n) Respectively associated with each of the sixth text data v_2x(x ═ 1, 2, …, j) are combined as a plurality of fourth text data, and all the first text data are respectively associated with each of the seventh text data v_3x(x ═ 1, 2, …, k) is combined as a plurality of fourth text data, which can be expressed as:

[(t₁+t₂+…+t_n+v₂₁),

(t₁+t₂+…+t_n+v₂₂),

…

(t₁+t₂+…+t_n+v_2j)

(t₁+t₂+…+t_n+v₃₁),

(t₁+t₂+…+t_n+v₃₂),

…

(t₁+t₂+…+t_n+v_3k)]。

then, calculating a TF feature matrix corresponding to the fourth text data:

[tf₁₁，tf₁₂，…，tf_1y,

tf₂₁，tf₂₂，…，tf_2y,

…

tf_z1，tf_z2，…，tf_zy]

where z is j + k, the TF value of each fourth text data corresponds to a row in the TF feature matrix, due to the word list { word }₁,word₂,…,word_yIs according to the third text data (t)₁+t₂+…+t_n+v₁+v₂+…+v_n) Obtained, and each fourth text data is (t)₁+t₂+…+t_n+v_x) Then the words in the word list may not be present in some fourth text data and therefore some corresponding positions of each row in the TF feature matrix may be 0, so that especially when the amount of video data is large, the resulting TF feature matrix is usually a sparse matrix.

Then, extracting the TF feature vector from the TF feature matrix according to a preset mode, where the preset mode for extracting the TF feature vector may be:

For TF feature matrices with equal number of rows and columns, a simple matrix calculation (e.g., a function diagonalin MATLAB) can be used to obtain diagonal elements as TF feature vectors. However, the number of rows and columns of the TF feature matrix obtained by actual calculation is usually not equal, and it is assumed that the number of rows and columns of the obtained TF feature matrix is a and b:

[tf₁₁，tf₁₂，…，tf_1b,

tf₂₁，tf₂₂，…，tf_2b,

…

tf_a1，tf_a2，…，tf_ab]

the diagonal elements selected are: tf is_row,colRow is the number of rows and col is the number of columns.

Wherein:

taking a 3 × 6 matrix as an example:

[tf₁₁，tf₁₂，…，tf₁₆,

tf₂₁，tf₂₂，…，tf₂₆,

tf₃₁，tf₃₂，…，tf₃₆]

the TF feature vector obtained by calculation and extraction of the formula is as follows:

[tf₁₁，tf₁₂，tf₁₃,tf₂₄，tf₃₅，tf₃₆]

in the present embodiment, the reason why all the first text data are combined with each of the sixth text data as a plurality of fourth text data, and all the first text data are combined with each of the seventh text data as a plurality of fourth text data, respectively, and not all the first text data are combined with each of the fifth text data as a plurality of fourth text data, respectively, is that the fifth text data correspond to the first video data type, and therefore, the fifth text data and the first text data are both related data to a single product, and there is no difference in nature. The sixth text data may include mobile phone product data of apple and Huashi at the same time, and the seventh text data may include product data of a cosmetic lipstick and a household appliance sweeping robot at the same time, so that the signal-to-noise ratio can be increased by combining all the first text data with each of the sixth/seventh text data as the fourth text data, and the calculation accuracy of the finally obtained model is improved.

And for the second video data type and the third video data type, the mutual interference degree generated by the sixth text data and the seventh text data corresponding to the second video data type and the third video data type on the texts with the same keywords is different. For the seventh text data, since the difference between product categories/brands related in the corresponding live video content may be larger, the degree of mutual interference generated by the texts with the same keywords may be actually smaller; for the sixth text data, since the difference between the product categories/brands involved in the corresponding live video content may be larger, the text with the same keywords may actually interfere with each other to a greater extent, as exemplified below:

example 1: the apple and Hua as mobile phone products are mentioned in sequence in the same video, because the apple mobile phone and the Hua as mobile phone are introduced in the video respectively, and parts such as a processor, a display screen and the like of the mobile phone are explained in the introduction, when keywords are extracted, the apple mobile phone, the Hua as, the processor and the display screen can be included at the same time, the apple mobile phone is introduced in the previous section of the video, the Hua as, the processor and the display screen are assumed to be introduced in the next section of the video, and for the text obtained by converting the previous section of the video, if the text is treated as a text alone, TF-IDF values corresponding to the keywords of the apple, the Hua as, the processor and the display screen are assumed to be [ tfidf [_a,0,tfidf_c1,tfidf_d1]If the text converted from the next video is processed as a text, the TF-IDF values corresponding to the keywords "apple", "Hua is", "processor" and "display screen" are assumed to be [0, tfidf_b,tfidf_c2,tfidf_d2]In fact, the TF-IDF values corresponding to the keywords "apple", "Hua is", "processor" and "display screen" in the text converted from the whole video are assumed to be [ tfidf_a,tfidf_b,tfidf_c,tfidf_d]Then tfidf therein_c≈2×tfidf_c1，tfidf_c≈2×tfidf_c2Therefore, the distortion of TF-IDF value is caused, namely the keywords 'processor' and 'display screen' mentioned in the later paragraph of introducing Hua mobile phone are added to the keywords 'processor' and 'display screen' mentioned in the previous paragraph of introducing apple mobile phone to mentionThe word frequency of the above-mentioned keyword is large, so that the TF-IDF value is large. In the traditional method, the texts after the video conversion can be segmented by manual or semantic analysis means, so that a plurality of products corresponding to a plurality of sections of videos are adopted respectively, but the labor cost or the algorithm difficulty is greatly improved.

Example 2: during the live video, there are also usually users who perform interactive questions with the main broadcast, and there are also comparisons with other brands during the questions and answers, so the converted text has distortion problem of the TF-IDF value as in example 1.

In this regard, the present embodiment solves the above problem by constructing a TF feature matrix and extracting TF feature vectors. Generally speaking, the TF-IDF values of the keywords in example 1 are sequentially "apple", "Hua" "," processor "," display screen "and other keywords in descending order, and after the TF feature matrix is constructed and the elements in each column of the TF feature matrix are arranged in descending order according to the magnitude of the values, diagonal elements in the TF feature matrix are extracted as TF feature vectors. At this time, for example, in the columns of the TF feature matrix corresponding to the "apple" and "hua ye" keywords, the selected elements are relatively located at the upper positions of the columns; for example, in the column of the TF feature matrix corresponding to the keywords of the "processor" and the "display screen", the selected element is located at the middle position of the column relatively; in the columns of the TF feature matrix corresponding to other unimportant keywords, the selected elements are located relatively below the columns, and since the above mentioned elements are usually sparse matrices when the amount of video data is large, the selected elements are smaller or 0. Through the method for extracting the TF feature vectors, the mutual interference caused by the same key words in the sixth text data can be obviously reduced; of course, if similar situations exist in the seventh text data, the interference can be reduced as well.

Further as a preferred embodiment, another specific implementation manner in step S5 is to combine all the first text data (t) for each class of data in the training set₁,t₂,…,t_n) And stationHas fifth text data v_1x(x ═ 1, 2, …, i) is associated with each of the sixth text data v, respectively_2x(x ═ 1, 2, …, j) are combined as a plurality of fourth text data, and all of the first text data and all of the fifth text data are respectively associated with each of the seventh text data v_3x(x ═ 1, 2, …, k) is combined as a plurality of fourth text data, which can be expressed as:

[(t₁+t₂+…+t_n+v₁₁+v₁₂+…+v_1i+v₂₁),

(t₁+t₂+…+t_n+v₁₁+v₁₂+…+v_1i+v₂₂),

…

(t₁+t₂+…+t_n+v₁₁+v₁₂+…+v_1i+v_2j)

(t₁+t₂+…+t_n+v₁₁+v₁₂+…+v_1i+v₃₁),

(t₁+t₂+…+t_n+v₁₁+v₁₂+…+v_1i+v₃₂),

…

(t₁+t₂+…+t_n+v₁₁+v₁₂+…+v_1i+v_3k)]。

then, calculating a TF feature matrix corresponding to the fourth text data:

[tf₁₁，tf₁₂，…，tf_1y,

tf₂₁，tf₂₂，…，tf_2y,

…

tf_z1，tf_z2，…，tf_zy]

where z is j + k, where the TF value of each fourth text data corresponds to a row in the TF feature matrix.

Then, TF feature vectors are extracted from the TF feature matrix according to a preset manner, which is specifically described in the above embodiments and will not be described herein again.

Unlike the previous embodiment, all the fifth text data is added in the present embodiment when the fourth text data is combined, and it has been mentioned above that the fifth text data corresponds to the first video data type, so that the fifth text data and the first text data are related data for a single product, and there is no difference in nature. Therefore, adding all the fifth text data is equivalent to expanding the number of the first text data, and the mutual interference caused by the same keywords mentioned in the previous embodiment can be further reduced, so that the result of the final product matching is more accurate.

Further as a preferred embodiment, the specific implementation in step S5 is to combine all the first text data (t) for each class of data in the training set₁,t₂,…,t_n) Respectively combined with each second text data (v)₁,v₂,…,v_m) As the plurality of fourth text data, the plurality of fourth text data may be expressed as:

[(t₁+t₂+…+t_n+v₁),

(t₁+t₂+…+t_n+v₂),

…

(t₁+t₂+…+t_n+v_m)]。

then, calculating a TF feature matrix corresponding to the fourth text data:

[tf₁₁，tf₁₂，…，tf_1y,

tf₂₁，tf₂₂，…，tf_2y,

…

tf_m1，tf_m2，…，tf_my]

wherein the TF value of each fourth text data corresponds to a row in the TF feature matrix.

Since the embodiment does not need to consider the video type in step S5 of the present embodiment, the corresponding second text data does not need to be labeled according to the video data type in step S3. Even if the corresponding second text data is labeled according to the video data type in step S3, it does not affect the above-described embodiment, and corresponds to only the second text data (v)₁,v₂,…,v_m) Is the fifth text data v in the previous two embodiments_1x(x ═ 1, 2, …, i), sixth text data v_2x(x ═ 1, 2, …, j) and seventh text data v_3xSet of (x ═ 1, 2, …, k), i.e., m ═ i + j + k.

Unlike the above two embodiments, the video type does not need to be considered in combining the fourth text data in the present embodiment, and the following analysis may regard the second text data as a set of fifth text data, sixth text data, and seventh text data. When the first text data and the fifth text data are combined, the mutual interference brought by the same keywords is not influenced actually; when the first text data and the sixth text data or the seventh text data are combined, mutual interference with the text data having the same keyword can be reduced. Therefore, in general, the processing method of the embodiment can reduce mutual interference caused by the same keywords, and in the data preprocessing process, the type of the video does not need to be labeled, so that the complexity of the processing process is reduced, the efficiency is remarkably improved, and the result of matching the final product can be more accurate.

Further as a preferred embodiment, in step S8, the text feature vector is multiplied by a parameter λ greater than 1, so that a nonzero value in the text feature vector is enhanced, and a better feature expression effect is achieved.

Further preferably, in step S8, the enhanced text feature vector is obtained by performing a power calculation on the elements in the text feature vector.

Generally speaking, the feature enhancement mode is to multiply the text feature vector by a parameter λ to enhance a non-zero value in the vector, but in the present scheme, because video data is introduced and interference is generated between different types of video data, for a smaller value and a larger value in the vector, the larger value needs to be enhanced, and the smaller value needs to be reduced, so that the text feature vector has a more prominent feature expression effect; and the TF-IDF value is usually a normalized value, so that a simple power calculation is performed on each element in the text feature vector to achieve the above-mentioned purpose.

Referring to a conventional big data processing algorithm, as a preferred embodiment, the parameters of the classifier may also be adjusted, specifically, the parameters are adjusted and steps S7 to S9 are repeatedly performed until the classification accuracy is no longer increased, where the parameters include a threshold K, and referring to fig. 2, the following steps may be adopted for the specific parameter adjustment:

adjusting and optimizing a parameter K: let K₀The maximum value of the elements in the TF-IDF feature vector (here, the TF-IDF feature vectors corresponding to all the classes) in step S6 is set to K₁₀Then, K is added₀And K₁₀Is divided by 10 and multiplied by 1, 2, 3, 4, 5, 6, 7, 8, 9, respectively, plus K₀To obtain K₁、K₂、K₃、K₄、K₅、K₆、K₇、K₈、K₉For each value of K (from K)₀To K₁₀) And repeating the steps S7-S9, and counting the obtained classification accuracy result. Then setting the K value corresponding to the result with the highest accuracy as K_nIs a reaction of K_n-1And K_n+1Respectively as new K₀And K₁₀And calculating a new K according to the above method₁、K₂、K₃、K₄、K₅、K₆、K₇、K₈、K₉And continuously and repeatedly executing the steps S7-S9 and counting the classification results. And then, carrying out the next iteration until the optimal classification accuracy rate is not improved any more, and taking the obtained K value as the optimal K value for subsequent experiments. The difference between the obtained classification accuracy and the obtained classification accuracy is less than a certain set threshold valueThe classification accuracy is not improved.

In a further preferred embodiment, the parameter for optimizing the classifier further includes a set threshold num_thIn the training process of the SVM classifier, a plurality of set thresholds are adopted for calculation in the step S7, and finally the step S9 is executed to obtain a plurality of SVM classifiers, and the SVM classifier with the highest classification accuracy is selected to classify the classification data set so as to obtain a commodity matching result.

Parameter num_thThe specific steps of tuning are similar to the tuning steps of the parameter K, and the following steps can be adopted: analyzing the number of words in the TF-IDF feature vector obtained in step S6, which are greater than or equal to the optimal threshold value K value in each class, and taking the lowest number of words as num_thRepeatedly executing the steps S7-S9 to obtain the result of the overall classification accuracy by statistics. Then num_thValue of (d) plus 10 as new num_thAnd (4) continuing iteration until the overall classification accuracy is not obviously improved any more. Then num_thIs decreased by 10, and each time 1 is added again as a new num_thThe values are iterated for the next round until the overall classification accuracy is no longer significantly improved, at which point the resulting num_thThe value is used as an optimally set threshold for subsequent calculations.

In addition, num can be classified according to the classification effect of a certain class_thThe values continue to be similarly adjusted until there is no longer a significant increase in the classification accuracy for that class.

Further preferably, in step S8, the elements in the text feature vector are further squared and multiplied by a parameter λ greater than 1, so that the larger value of the text feature vector is enhanced, and the smaller value is reduced (including a non-zero value), thereby having the optimal feature expression effect.

The parameter for tuning the classifier further includes a parameter λ, and the tuning of the parameter λ may adopt the following steps:

and setting the initial value of lambda as 1, repeatedly executing the steps S7-S9, and counting the obtained classification accuracy result. And multiplying the value of the lambda by 10 to serve as a new lambda value, and continuing iteration until the classification accuracy is not obviously improved any more. The value of λ is then divided by 10 and each multiplication by 3 is continued as a new value of λ for the next iteration until there is no significant increase in classification accuracy. And then dividing the value of the lambda by 3, and continuing to add 1 each time to be used as a new lambda value for the next iteration until the classification accuracy rate is not obviously improved any more, and using the obtained lambda value as an optimal lambda value for subsequent calculation.

After the parameters are optimized, the SVM classifier is stored according to the finally obtained parameters, namely, the steps S7-S9 are executed by using the optimal parameter values obtained in the parameter optimizing step, and the obtained SVM classifier model parameters are stored for the classification of the subsequent overall data set, so that the optimal commodity matching result is obtained.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A commodity matching method based on big data is characterized by comprising the following steps:

2. The commodity matching method based on big data according to claim 1, wherein: the step of performing classification labeling and Chinese word segmentation by using partial data in the classification data set as training data of the SVM classifier comprises

3. The commodity matching method based on big data according to claim 2, wherein: the combining, for each class of data in the training set, all the first text data with at least one of the second text data, as a plurality of fourth text data, specifically includes

4. The commodity matching method based on big data according to claim 2, wherein: the combining, for each class of data in the training set, all the first text data with at least one of the second text data, as a plurality of fourth text data, specifically includes

5. The commodity matching method based on big data according to claim 1, wherein: the combining, for each class of data in the training set, all the first text data with at least one of the second text data, as a plurality of fourth text data, specifically includes

6. The big data based commodity matching method according to any one of claims 1-5, wherein: the extracting TF feature vectors from the TF feature matrix according to a preset mode specifically comprises

7. The big data based commodity matching method according to any one of claims 1-5, wherein: the step of establishing a keyword list and a feature dictionary according to the word list of each type of data in the training set and the corresponding TF-IDF feature vector comprises the following steps of

8. The big-data-based commodity matching method according to claim 7, wherein: in the training process of the SVM classifier, a plurality of set thresholds are adopted for calculation to obtain a plurality of SVM classifiers, and the SVM classifier with the highest classification accuracy is selected for classifying the classification data set so as to obtain a commodity matching result.

9. The big data based commodity matching method according to any one of claims 1-5, wherein: the step of mapping the first text data and the second text data in the training set to the keyword list to obtain the text characteristic vector and construct the text characteristic vector comprises the following steps of

10. The big-data-based commodity matching method according to claim 9, wherein: and performing power calculation on the elements in the text feature vector to obtain an enhanced text feature vector.