CN113761882A - Dictionary construction method and device - Google Patents

Dictionary construction method and device Download PDF

Info

Publication number
CN113761882A
CN113761882A CN202010513089.7A CN202010513089A CN113761882A CN 113761882 A CN113761882 A CN 113761882A CN 202010513089 A CN202010513089 A CN 202010513089A CN 113761882 A CN113761882 A CN 113761882A
Authority
CN
China
Prior art keywords
commodity
probability
clause
word
belongs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010513089.7A
Other languages
Chinese (zh)
Inventor
李浩然
袁鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202010513089.7A priority Critical patent/CN113761882A/en
Publication of CN113761882A publication Critical patent/CN113761882A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a dictionary construction method and device, and relates to the technical field of computers. One embodiment of the method comprises: dividing sentences in the target text into one or more clauses according to punctuation marks; predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary by using a text classification model trained based on a semi-supervised learning algorithm; calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability; adding the word as an element word of the commodity element to the commodity element dictionary in a case where a second probability that the word belongs to the commodity element is greater than a second threshold probability. The implementation mode realizes automatic expansion of the dictionary and improves the construction efficiency of the dictionary.

Description

Dictionary construction method and device
Technical Field
The invention relates to the technical field of computers, in particular to a dictionary construction method and device.
Background
In the field of electronic commerce, in order to facilitate a user to quickly know the performance of a commodity to stimulate the purchase desire of the user, a commodity abstract covering commodity elements is often generated for the user according to detailed information of the commodity, wherein the commodity elements refer to words for describing the performance of the commodity. In the process of generating the commodity abstract, the commodity elements included in the detailed information of the commodity need to be determined based on the element words of the commodity elements included in the pre-constructed commodity element dictionary, and further the commodity abstract covering the commodity elements needs to be generated for the user. The term "screen" refers to a commodity element having multiple commodity elements such as "screen" and "battery", and the term "screen" refers to a commodity element having multiple term such as "screen size", "resolution", "full screen" and "curved screen".
Therefore, the completeness and accuracy of the constructed commodity element dictionary are important for determining the commodity elements contained in the detailed commodity information, and further the quality of the generated commodity abstract is influenced. At present, a commodity element dictionary is constructed in a manual labeling mode, but because the number of description texts of commodities is huge, the manual labeling efficiency is low, and the construction efficiency of the commodity element dictionary is influenced; in addition, different people have different cognition on the commodity elements, so that the labeling consistency is poor, the accuracy and the coverage rate are low, and the constructed commodity element dictionary has poor practical application effect.
Disclosure of Invention
In view of this, embodiments of the present invention provide a dictionary construction method and apparatus, which can use a text classification model constructed based on a semi-supervised learning algorithm to automatically expand and update a commodity element dictionary, so that the constructed commodity element dictionary is more complete and accurate.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a dictionary construction method including:
dividing sentences in the target text into one or more clauses according to punctuation marks;
predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary by using a text classification model trained based on a semi-supervised learning algorithm;
calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability;
adding the word as an element word of the commodity element to the commodity element dictionary in a case where a second probability that the word belongs to the commodity element is greater than a second threshold probability.
Optionally, training the text classification model based on the semi-supervised learning algorithm includes:
obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element;
inputting the training data into the text classification model to calculate a current loss function of the text classification model according to a first probability that the clause belongs to the commodity element, and optimizing the text classification model according to the current loss function;
predicting a third probability that a clause not containing the commodity element belongs to the commodity element by using the optimized text classification model;
and under the condition that the third probability is greater than the first threshold probability, adding the clauses and the third probability corresponding to the clauses to the training data so as to continuously optimize the text classification model.
Optionally, the obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element, includes:
judging whether the clauses contain element words of the commodity elements contained in the commodity element dictionary;
if the clause contains the element words of the commodity elements, the first probability is a first value;
and if the clause does not contain the element words of the commodity elements and other clauses adjacent to the clause in the same sentence contain the element words of the commodity elements, the first probability is a second value.
Optionally, the calculating, when a first probability that the clause is attributed to the commodity element is greater than a first threshold probability, a second probability that a word other than an element word currently included in the commodity element in the clause is attributed to the commodity element includes:
determining the probability that the word belongs to the commodity element in different clauses according to the probability that the clauses belong to the commodity element;
calculating an average of the probabilities that the word belongs to the commodity element in the different clauses to obtain the second probability.
Optionally, in a case that a first probability that the clause belongs to the commodity element is greater than a first threshold probability, further comprising:
calculating the occurrence frequency of the words in all the clauses belonging to the same commodity element;
determining one or more words according to the sequence of the appearance frequency of the words from high to low so as to add the words as element words of the commodity elements to the commodity element dictionary.
Optionally, the method further comprises:
before calculating a second probability that words except element words currently contained in the commodity element in the clause belong to the commodity element, judging whether the words belong to a forbidden element word list;
in the case where the word does not belong to the list of forbidden element words, calculating a second probability that the word belongs to the commodity element.
Optionally, the method further comprises:
before calculating a second probability that words except element words currently contained in the commodity element in the clause belong to the commodity element, judging whether the words belong to stop words or not;
in the event that the word does not belong to a stop word, calculating a second probability that the word belongs to the commodity element.
To achieve the above object, according to another aspect of an embodiment of the present invention, there is provided a dictionary construction apparatus including: the clause prediction module is used for predicting the clause of the user; wherein the content of the first and second substances,
the clause acquisition module is used for dividing sentences in the target text into one or more clauses according to punctuation marks;
the first probability prediction module is used for predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary by using a text classification model trained based on a semi-supervised learning algorithm;
the second probability calculation module is configured to calculate a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability;
the dictionary expansion module is configured to add the word as a component word of the commodity component to the commodity component dictionary when a second probability that the word belongs to the commodity component is greater than a second threshold probability.
Optionally, the method further comprises: a classification model training module; wherein the classification model training module is used for,
obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element;
inputting the training data into the text classification model to calculate a current loss function of the text classification model according to a first probability that the clause belongs to the commodity element, and optimizing the text classification model according to the current loss function;
predicting a third probability that a clause not containing the commodity element belongs to the commodity element by using the optimized text classification model;
and under the condition that the third probability is greater than the first threshold probability, adding the clauses and the third probability corresponding to the clauses to the training data so as to continuously optimize the text classification model.
Optionally, the obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element, includes:
judging whether the clauses contain element words of the commodity elements contained in the commodity element dictionary;
if the clause contains the element words of the commodity elements, the first probability is a first value;
and if the clause does not contain the element words of the commodity elements and other clauses adjacent to the clause in the same sentence contain the element words of the commodity elements, the first probability is a second value.
Optionally, the calculating, when a first probability that the clause is attributed to the commodity element is greater than a first threshold probability, a second probability that a word other than an element word currently included in the commodity element in the clause is attributed to the commodity element includes:
determining the probability that the word belongs to the commodity element in different clauses according to the probability that the clauses belong to the commodity element;
calculating an average of the probabilities that the word belongs to the commodity element in the different clauses to obtain the second probability.
Optionally, in a case that a first probability that the clause belongs to the commodity element is greater than a first threshold probability, further comprising:
calculating the occurrence frequency of the words in all the clauses belonging to the same commodity element;
determining one or more words according to the sequence of the appearance frequency of the words from high to low so as to add the words as element words of the commodity elements to the commodity element dictionary.
Optionally, the second probability calculation module is further configured to,
before calculating a second probability that words except element words currently contained in the commodity element in the clause belong to the commodity element, judging whether the words belong to a forbidden element word list;
in the case where the word does not belong to the list of forbidden element words, calculating a second probability that the word belongs to the commodity element.
Optionally, the second probability calculation module is further configured to,
before calculating a second probability that words except element words currently contained in the commodity element in the clause belong to the commodity element, judging whether the words belong to stop words or not;
in the event that the word does not belong to a stop word, calculating a second probability that the word belongs to the commodity element.
To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided a dictionary construction electronic device including:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement a method as any one of the dictionary construction methods described above.
To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, the program, when executed by a processor, implementing any one of the dictionary construction methods described above.
One embodiment of the above invention has the following advantages or benefits: dividing sentences in the target text into one or more clauses according to punctuation marks; predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary by using a text classification model trained based on a semi-supervised learning algorithm; calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability; adding the word as an element word of the commodity element to the commodity element dictionary in a case where a second probability that the word belongs to the commodity element is greater than a second threshold probability. Based on the method, the automatic expansion of the commodity element dictionary is realized, a large amount of manual labeling is not needed, the efficiency of constructing the commodity element dictionary is improved, the problem of poor consistency caused by manual labeling is avoided, and the quality of the commodity element dictionary is improved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of a main flow of a dictionary construction method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a main flow of a training method of a text classification model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the main modules of a dictionary construction apparatus according to an embodiment of the present invention;
FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 5 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of a main flow of a dictionary construction method according to an embodiment of the present invention, and as shown in fig. 1, the dictionary construction method may specifically include the following steps:
step S101, dividing the sentences in the target text into one or more clauses according to punctuation marks.
Specifically, the sentence "the mobile phone adopts a full-screen, the screen occupation ratio is up to 90%, and the screen brightness is high. "for example, the punctuation marks", "and" may be used in the sentence. "divide the sentence into the following three clauses:
clause 1: the mobile phone adopts a comprehensive screen
Clause 2: the screen accounts for up to 90 percent
Clause 3: high screen brightness
Step S102, a text classification model trained based on a semi-supervised learning algorithm is used for predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary.
Semi-supervised learning algorithms that may be employed include, but are not limited to, generative semi-supervised models (generative semi-supervised models), Self-training algorithms (Self-training), Co-training (Co-training), semi-supervised support vector machines (S3VMs), bert (bidirectional Encoder retrieval from transformations) algorithms, and the like.
The commodity element dictionary comprises one or more commodity types, one or more commodity elements corresponding to the commodity types and element words corresponding to each commodity element. The pre-constructed commodity element dictionary is completed based on manual labeling, one or more commodity elements and one or more element words corresponding to the commodity elements can be pre-constructed for commodity types based on actual experience in the manual labeling process, and then the element words corresponding to the commodity elements are added in the process of labeling clauses in a commodity description text; meanwhile, the method can be not limited to the predefined commodity elements, and new commodity elements can be added or existing unreasonable commodity elements can be deleted according to actual conditions. For example, when two clauses "the mobile phone adopts a ceramic body" and "feels warm" in the same sentence in the label commodity description text, it is known that "ceramic" can be used as an element word of "material" of the commodity element, and "warm" describes "touch" of the mobile phone, but "touch" is not in the predefined commodity elements, so that a new commodity element "feel" can be added to the commodity element dictionary, and the corresponding element word includes "warm" and "feel". Specifically, the pre-constructed commodity element dictionary is shown in the following table 1,
TABLE 1 Commodity element dictionary example
Figure BDA0002529134610000081
Figure BDA0002529134610000091
In an alternative embodiment, training the text classification model based on the semi-supervised learning algorithm includes: obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element; inputting the training data into the text classification model to calculate a current loss function of the text classification model according to a first probability that the clause belongs to the commodity element, and optimizing the text classification model according to the current loss function; predicting a third probability that a clause not containing the commodity element belongs to the commodity element by using the optimized text classification model; and under the condition that the third probability is greater than the first threshold probability, adding the clauses and the third probability corresponding to the clauses to the training data so as to continuously optimize the text classification model.
That is to say, in the process of training the classification model, a small amount of training data obtained by manual labeling can be used for training in advance, and the probability that the clause belongs to the commodity element is used for adjusting the loss function of the sample, so that a more accurate and real loss function is obtained, and further, the optimization of the text classification model is realized. On the basis, the probability that the clause not containing the commodity element belongs to the commodity element is predicted by using the text classification model, and on the basis that the third probability that the predicted clause belongs to the commodity element is larger than the first threshold probability, the clause and the corresponding third probability are added into training data to realize continuous expansion of the training data so as to further optimize the text classification model.
Still further, the obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element, comprises: judging whether the clauses contain element words of the commodity elements contained in the commodity element dictionary; if the clause contains the element words of the commodity elements, the first probability is a first value; and if the clause does not contain the element words of the commodity elements and other clauses adjacent to the clause in the same sentence contain the element words of the commodity elements, the first probability is a second value. The first value and the second value are preset according to actual requirements, generally, the second value is smaller than the first value, for example, the first value is 1, and the second value is 0.8.
Specifically, referring to table 1, the sentence "this mobile phone uses a full-screen, the screen occupation ratio is up to 90%, and the screen brightness is high. "is illustrated as an example, which includes the following three clauses:
clause 1: the mobile phone adopts a comprehensive screen
Clause 2: the screen accounts for up to 90 percent
Clause 3: high screen brightness
Since clause 1 contains the element word "full screen" of the merchandise element "screen", the first probability that clause 1 belongs to the merchandise element "screen" is determined to be a first value (e.g., 1); clause 3 contains the element word "screen" of the merchandise element "screen", and thus the first probability that clause 3 belongs to the merchandise element "screen" is determined to be a first value (1 as an example); the clause 2 does not include any element words of commodity elements included in the commodity element dictionary, but since both of the clauses 1 and 3 adjacent to the clause 2 in the same sentence include an element word of the commodity element "screen", it is determined that the clause 2 is highly likely to be related to the commodity element "screen" and the probability that the clause 2 belongs to the commodity element "screen" is a second value (0.8 as an example). Therefore, clauses in a small quantity of commodity description texts can be labeled based on the commodity element dictionary, and then the value of the first probability corresponding to the clauses is determined so as to obtain training data. It should be noted that, in the process of acquiring the training data, only the clauses including the element words of the commodity elements or the situation that at least one adjacent clause in the same sentence includes the commodity elements are considered, and the other clauses cannot be determined whether to be related to the commodity elements, so that the step is not considered.
Step S103, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability, calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element.
Wherein the first threshold is any one of 0-1, such as 0.8, 0.7, etc., set according to practical experience. Specifically, the text classification model may predict a first probability that a clause belongs to each commodity element included in the commodity element dictionary, and a sum of first probabilities that the same clause belongs to different commodity elements is 1. On the basis, commodity elements to which the clauses belong can be determined according to the maximum value in the first probability; and then judging whether the first probability of the clause is greater than the first threshold probability, if so, indicating that the clause belongs to the commodity element and is credible, if not, indicating that the clause belongs to the commodity element and is incredible, and if and only if the probability that the clause belongs to the commodity element is credible, continuously calculating a second probability that the word in the clause belongs to the commodity element.
In an alternative embodiment, the calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element when a first probability that the clause belongs to the commodity element is greater than a first threshold probability includes: determining the probability that the word belongs to the commodity element in different clauses according to the probability that the clauses belong to the commodity element; calculating an average of the probabilities that the word belongs to the commodity element in the different clauses to obtain the second probability.
Specifically, taking the example that the first probability that the clause a belongs to the commodity element a is 0.8, the probability that any word X in the clause a other than the element word currently included in the commodity element belongs to the commodity element is determined to be 0.8 as well. It is understood that, in the actual implementation process, it may also be adjusted based on the first probability according to a preset weight (for example, 0.8), and then the probability that the word X belongs to the commodity element a is 0.64. Based on the above, after calculating the probability that the same word X belongs to the commodity element a in different clauses, calculating the average probability that the word X belongs to the commodity element a, namely the second probability that the word X belongs to the commodity element.
In an optional embodiment, in a case that the first probability that the clause is attributed to the commodity element is greater than a first threshold probability, the method further includes: calculating the occurrence frequency of the words in all the clauses belonging to the same commodity element; and determining one or more words according to the sequence of the appearance frequency of the words from high to low so as to add the words serving as element words of the commodity elements to the commodity element dictionary to realize the expansion of the commodity element dictionary.
Specifically, the description will be given taking as an example that the first probability that the clause a belongs to the commodity element a is 0.8, the first probability that the clause B belongs to the commodity element a is 0.7, the first probability that the clause C belongs to the commodity element a is 0.75, and the first probability that the clause D belongs to the commodity element a is 0.9. Counting the occurrence frequencies of all words (taking the word 1, the word 2 and the word 3 as examples) except the element words currently contained in the commodity elements in the clauses a, B, C and D, and if the occurrence frequencies of the word 1, the word 2 and the word 3 are divided into 5, 4 and 3, two words with high occurrence frequencies, namely the word 1 and the word 2, can be used as the element words of the commodity element a and added into the commodity element dictionary, so as to further realize the expansion of the commodity element dictionary.
In an alternative embodiment, before calculating a second probability that a word other than the element word currently contained in the commodity element in the clause belongs to the commodity element, determining whether the word belongs to a forbidden element word list; in the case where the word does not belong to the list of forbidden element words, calculating a second probability that the word belongs to the commodity element.
That is to say, under the condition that forbidden element words are constructed according to actual requirements, words in clauses can be preliminarily screened so as to reduce the amount of words needing to calculate the second probability, and further, the calculation efficiency can be improved.
In an alternative embodiment, before calculating a second probability that a word other than the element word currently contained in the commodity element in the clause belongs to the commodity element, judging whether the word belongs to a stop word; in the event that the word does not belong to a stop word, calculating a second probability that the word belongs to the commodity element.
Stop Words refer to that in information retrieval, in order to save storage space and improve search efficiency, some Words or Words are automatically filtered before or after processing natural language data (or text), and these Words or Words are called Stop Words, such as "this", "that", and so on. Therefore, the amount of words needing to calculate the second probability is further reduced through filtering the stop words, and the calculation efficiency is improved.
Step S104 of adding an element word using the word as the commodity element to the commodity element dictionary when a second probability that the word belongs to the commodity element is greater than a second threshold probability. The second threshold probability is also any value within the interval of 0-1, such as 0.75, 0.8, etc., set based on practical experience or demand.
Based on the embodiment, the sentences in the target text are divided into one or more clauses according to punctuation marks; predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary by using a text classification model trained based on a semi-supervised learning algorithm; calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability; adding the word as an element word of the commodity element to the commodity element dictionary in a case where a second probability that the word belongs to the commodity element is greater than a second threshold probability. Based on the method, the automatic expansion of the commodity element dictionary is realized, a large amount of manual labeling is not needed, the efficiency of constructing the commodity element dictionary is improved, the problem of poor consistency caused by manual labeling is avoided, and the quality of the commodity element dictionary is improved.
Referring to fig. 2, on the basis of the foregoing embodiment, a method for training a text classification model is provided, which may specifically include the following steps:
step S201, obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element.
Wherein the obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element, comprises: judging whether the clauses contain element words of the commodity elements contained in the commodity element dictionary; if the clause contains the element words of the commodity elements, the first probability is a first value; and if the clause does not contain the element words of the commodity elements and other clauses adjacent to the clause in the same sentence contain the element words of the commodity elements, the first probability is a second value. The first value and the second value are set according to actual requirements, generally, the second value is smaller than the first value, for example, the first value is 1, and the second value is 0.8.
Specifically, referring to table 1, the sentence "this mobile phone uses a full-screen, the screen occupation ratio is up to 90%, and the screen brightness is high. "is illustrated as an example, which includes the following three clauses:
clause 1: the mobile phone adopts a comprehensive screen
Clause 2: the screen accounts for up to 90 percent
Clause 3: high screen brightness
Since clause 1 contains the element word "full screen" of the merchandise element "screen", the first probability that clause 1 belongs to the merchandise element "screen" is determined to be a first value (e.g., 1); clause 3 contains the element word "screen" of the merchandise element "screen", and thus the first probability that clause 3 belongs to the merchandise element "screen" is determined to be a first value (1 as an example); the clause 2 does not include any element words of commodity elements included in the commodity element dictionary, but since both of the clauses 1 and 3 adjacent to the clause 2 in the same sentence include an element word of the commodity element "screen", it is determined that the clause 2 is highly likely to be related to the commodity element "screen" and the probability that the clause 2 belongs to the epithelial element "screen" is a second value (0.8 as an example). Therefore, clauses in a small quantity of commodity description texts can be labeled based on the commodity element dictionary, and then the value of the first probability corresponding to the clauses is determined so as to obtain training data. It should be noted that, in the process of acquiring the training data, only the clauses including the element words of the commodity elements or the situation that at least one adjacent clause in the same sentence includes the commodity elements are considered, and the other clauses cannot be determined whether to be related to the commodity elements, so that the step is not considered.
Step S202, inputting the training data into the text classification model, so as to calculate a current loss function of the text classification model according to a first probability that the clause belongs to the commodity element, and optimizing the text classification model according to the current loss function.
Specifically, after the training data is input into the text classification model, the text classification model predicts a first probability that each training sample in the training data belongs to each commodity element in the commodity element words, and then calculates a loss function corresponding to each clause based on the first probability that the text classification model predicts the clause and the first probability of the clause in the training data. If the loss function corresponding to the clause i in the training data is li, the loss function of the text classification model at the current time can be calculated according to the following formula:
L=∑Pi*li
wherein, L is a loss function of the text classification model, li is a loss function corresponding to the clause i in the training data, and Pi is a first probability that the clause i in the training data belongs to the commodity element.
Step S203, using the optimized text classification model to predict a third probability that a clause not including a commodity element belongs to the commodity element.
Step S204, under the condition that the third probability is larger than the first threshold probability, the clause and the third probability corresponding to the clause are added to the training data so as to continuously optimize the text classification model.
That is, the probability that the clause not including the commodity element belongs to the commodity element is predicted by using the text classification model, and then on the basis that the third probability that the predicted clause belongs to the commodity element is larger than the first threshold probability, the clause and the corresponding third probability are added into the training data, so that continuous expansion of the training data is realized, the text classification model can be further optimized based on the newly expanded training data, and the text classification model is circularly and repeatedly improved in accuracy and does not need a large amount of manual labeling.
Referring to fig. 3, on the basis of the above embodiment, an embodiment of the present invention provides a dictionary construction apparatus 300, including: a clause acquisition module 301, a first probability prediction module 303, a second probability calculation module 304, and a dictionary expansion module 305; wherein the content of the first and second substances,
the clause acquiring module 301 is configured to divide a sentence in a target text into one or more clauses according to punctuation marks;
the first probability prediction module 303 is configured to predict a first probability that a clause in the target text belongs to a commodity element included in a pre-constructed commodity element dictionary, using a text classification model trained based on a semi-supervised learning algorithm;
the second probability calculation module 304 is configured to calculate a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability;
the dictionary expansion module 305 is configured to add the word as the element word of the commodity element to the commodity element dictionary if a second probability that the word belongs to the commodity element is greater than a second threshold probability.
In an optional embodiment, the method further comprises: a classification model training module 302; wherein the classification model training module 302 is configured to,
obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element;
inputting the training data into the text classification model to calculate a current loss function of the text classification model according to a first probability that the clause belongs to the commodity element, and optimizing the text classification model according to the current loss function;
predicting a third probability that a clause not containing the commodity element belongs to the commodity element by using the optimized text classification model;
and under the condition that the third probability is greater than the first threshold probability, adding the clauses and the third probability corresponding to the clauses to the training data so as to continuously optimize the text classification model.
In an alternative embodiment, the obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element, includes:
judging whether the clauses contain element words of the commodity elements contained in the commodity element dictionary;
if the clause contains the element words of the commodity elements, the first probability is a first value;
and if the clause does not contain the element words of the commodity elements and other clauses adjacent to the clause in the same sentence contain the element words of the commodity elements, the first probability is a second value.
In an alternative embodiment, the calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element when a first probability that the clause belongs to the commodity element is greater than a first threshold probability includes:
determining the probability that the word belongs to the commodity element in different clauses according to the probability that the clauses belong to the commodity element;
calculating an average of the probabilities that the word belongs to the commodity element in the different clauses to obtain the second probability.
In an optional embodiment, in a case that the first probability that the clause is attributed to the commodity element is greater than a first threshold probability, the method further includes:
calculating the occurrence frequency of the words in all the clauses belonging to the same commodity element;
determining one or more words according to the sequence of the appearance frequency of the words from high to low so as to add the words as element words of the commodity elements to the commodity element dictionary.
In an alternative embodiment, the second probability calculation module 304 is further configured to,
before calculating a second probability that words except element words currently contained in the commodity element in the clause belong to the commodity element, judging whether the words belong to a forbidden element word list;
in the case where the word does not belong to the list of forbidden element words, calculating a second probability that the word belongs to the commodity element.
In an alternative embodiment, the second probability calculation module 304 is further configured to,
before calculating a second probability that words except element words currently contained in the commodity element in the clause belong to the commodity element, judging whether the words belong to stop words or not;
in the event that the word does not belong to a stop word, calculating a second probability that the word belongs to the commodity element.
Fig. 4 shows an exemplary system architecture 400 to which the dictionary construction method or the dictionary construction apparatus of the embodiment of the present invention can be applied.
As shown in fig. 4, the system architecture 400 may include terminal devices 401, 402, 403, a network 404, and a server 405. The network 404 serves as a medium for providing communication links between the terminal devices 401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal devices 401, 402, 403 to interact with a server 405 over a network 404 to receive or send messages or the like. The terminal devices 401, 402, 403 may have various communication client applications installed thereon, such as a shopping application, a web browser application, a search application, and the like.
The terminal devices 401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 405 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using the terminal devices 401, 402, and 403. The background management server can analyze and process the received data such as the product information query request and the like so as to obtain one or more clauses and the like in the commodity description text.
It should be noted that the dictionary construction method provided in the embodiment of the present invention is generally executed by the server 405, and accordingly, the dictionary construction device is generally provided in the server 405.
It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 5, shown is a block diagram of a computer system 500 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a module. The names of these modules do not in some cases constitute a limitation to the modules themselves, and for example, the sending unit may also be described as a "module that sends a picture acquisition request to a connected server".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: dividing sentences in the target text into one or more clauses according to punctuation marks; predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary by using a text classification model trained based on a semi-supervised learning algorithm; calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability; adding the word as an element word of the commodity element to the commodity element dictionary in a case where a second probability that the word belongs to the commodity element is greater than a second threshold probability.
According to the technical scheme of the embodiment of the invention, the sentences in the target text are divided into one or more clauses according to punctuation marks; predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary by using a text classification model trained based on a semi-supervised learning algorithm; calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability; adding the word as an element word of the commodity element to the commodity element dictionary in a case where a second probability that the word belongs to the commodity element is greater than a second threshold probability. Based on the method, the automatic expansion of the commodity element dictionary is realized, a large amount of manual labeling is not needed, the efficiency of constructing the commodity element dictionary is improved, the problem of poor consistency caused by manual labeling is avoided, and the quality of the commodity element dictionary is improved.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A dictionary construction method is characterized by comprising the following steps:
dividing sentences in the target text into one or more clauses according to punctuation marks;
predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary by using a text classification model trained based on a semi-supervised learning algorithm;
calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability;
adding the word as an element word of the commodity element to the commodity element dictionary in a case where a second probability that the word belongs to the commodity element is greater than a second threshold probability.
2. The dictionary construction method according to claim 1, wherein training the text classification model based on the semi-supervised learning algorithm comprises:
obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element;
inputting the training data into the text classification model to calculate a current loss function of the text classification model according to a first probability that the clause belongs to the commodity element, and optimizing the text classification model according to the current loss function;
predicting a third probability that a clause not containing the commodity element belongs to the commodity element by using the optimized text classification model;
and under the condition that the third probability is greater than the first threshold probability, adding the clauses and the third probability corresponding to the clauses to the training data so as to continuously optimize the text classification model.
3. The dictionary construction method according to claim 2, wherein the obtaining training data based on the commodity element dictionary, the training data indicating one or more clauses and a first probability that the clause belongs to the commodity element, includes:
judging whether the clauses contain element words of the commodity elements contained in the commodity element dictionary;
if the clause contains the element words of the commodity elements, the first probability is a first value;
and if the clause does not contain the element words of the commodity elements and other clauses adjacent to the clause in the same sentence contain the element words of the commodity elements, the first probability is a second value.
4. The dictionary construction method according to claim 1, wherein the calculating a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element in a case where a first probability that the clause belongs to the commodity element is larger than a first threshold probability includes:
determining the probability that the word belongs to the commodity element in different clauses according to the probability that the clauses belong to the commodity element;
calculating an average of the probabilities that the word belongs to the commodity element in the different clauses to obtain the second probability.
5. The dictionary construction method according to claim 1, further comprising, in a case where a first probability that the clause is attributed to the commodity element is larger than a first threshold probability:
calculating the occurrence frequency of the words in all the clauses belonging to the same commodity element;
determining one or more words according to the sequence of the appearance frequency of the words from high to low so as to add the words as element words of the commodity elements to the commodity element dictionary.
6. The dictionary construction method according to claim 1, further comprising:
before calculating a second probability that words except element words currently contained in the commodity element in the clause belong to the commodity element, judging whether the words belong to a forbidden element word list;
in the case where the word does not belong to the list of forbidden element words, calculating a second probability that the word belongs to the commodity element.
7. The dictionary construction method according to claim 1, further comprising:
before calculating a second probability that words except element words currently contained in the commodity element in the clause belong to the commodity element, judging whether the words belong to stop words or not;
in the event that the word does not belong to a stop word, calculating a second probability that the word belongs to the commodity element.
8. A dictionary building apparatus characterized by comprising: the clause prediction module is used for predicting the clause of the user; wherein the content of the first and second substances,
the clause acquisition module is used for dividing sentences in the target text into one or more clauses according to punctuation marks;
the first probability prediction module is used for predicting a first probability that a clause in a target text belongs to a commodity element contained in a pre-constructed commodity element dictionary by using a text classification model trained based on a semi-supervised learning algorithm;
the second probability calculation module is configured to calculate a second probability that a word other than the element word currently included in the commodity element in the clause belongs to the commodity element, if a first probability that the clause belongs to the commodity element is greater than a first threshold probability;
the dictionary expansion module is configured to add the word as a component word of the commodity component to the commodity component dictionary when a second probability that the word belongs to the commodity component is greater than a second threshold probability.
9. A dictionary construction electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202010513089.7A 2020-06-08 2020-06-08 Dictionary construction method and device Pending CN113761882A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010513089.7A CN113761882A (en) 2020-06-08 2020-06-08 Dictionary construction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010513089.7A CN113761882A (en) 2020-06-08 2020-06-08 Dictionary construction method and device

Publications (1)

Publication Number Publication Date
CN113761882A true CN113761882A (en) 2021-12-07

Family

ID=78785404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010513089.7A Pending CN113761882A (en) 2020-06-08 2020-06-08 Dictionary construction method and device

Country Status (1)

Country Link
CN (1) CN113761882A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968788A (en) * 2009-07-27 2011-02-09 富士通株式会社 Method and device for extracting product attribute information
US20160247214A1 (en) * 2015-02-24 2016-08-25 Naver Corporation System and method for providing response information on product by existing users through network
CN107301248A (en) * 2017-07-19 2017-10-27 百度在线网络技术(北京)有限公司 Term vector construction method and device, computer equipment, the storage medium of text
KR20180080492A (en) * 2017-01-04 2018-07-12 (주)프람트테크놀로지 Rating system and method for goods using user's reviews
CN108763226A (en) * 2016-06-28 2018-11-06 大连民族大学 The abstracting method of comment on commodity element
CN108959259A (en) * 2018-07-05 2018-12-07 第四范式(北京)技术有限公司 New word discovery method and system
WO2019200806A1 (en) * 2018-04-20 2019-10-24 平安科技(深圳)有限公司 Device for generating text classification model, method, and computer readable storage medium
CN110442728A (en) * 2019-06-28 2019-11-12 天津大学 Sentiment dictionary construction method based on word2vec automobile product field
CN110517121A (en) * 2019-09-23 2019-11-29 重庆邮电大学 Method of Commodity Recommendation and the device for recommending the commodity based on comment text sentiment analysis
CN111090987A (en) * 2019-12-27 2020-05-01 北京百度网讯科技有限公司 Method and apparatus for outputting information

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968788A (en) * 2009-07-27 2011-02-09 富士通株式会社 Method and device for extracting product attribute information
US20160247214A1 (en) * 2015-02-24 2016-08-25 Naver Corporation System and method for providing response information on product by existing users through network
CN108763226A (en) * 2016-06-28 2018-11-06 大连民族大学 The abstracting method of comment on commodity element
KR20180080492A (en) * 2017-01-04 2018-07-12 (주)프람트테크놀로지 Rating system and method for goods using user's reviews
CN107301248A (en) * 2017-07-19 2017-10-27 百度在线网络技术(北京)有限公司 Term vector construction method and device, computer equipment, the storage medium of text
WO2019200806A1 (en) * 2018-04-20 2019-10-24 平安科技(深圳)有限公司 Device for generating text classification model, method, and computer readable storage medium
CN108959259A (en) * 2018-07-05 2018-12-07 第四范式(北京)技术有限公司 New word discovery method and system
CN110442728A (en) * 2019-06-28 2019-11-12 天津大学 Sentiment dictionary construction method based on word2vec automobile product field
CN110517121A (en) * 2019-09-23 2019-11-29 重庆邮电大学 Method of Commodity Recommendation and the device for recommending the commodity based on comment text sentiment analysis
CN111090987A (en) * 2019-12-27 2020-05-01 北京百度网讯科技有限公司 Method and apparatus for outputting information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曹宇;王名扬;贺惠新;: "情感词典扩充的微博文本多元情感分类研究", 情报杂志, no. 10, 18 October 2016 (2016-10-18) *

Similar Documents

Publication Publication Date Title
US20190163742A1 (en) Method and apparatus for generating information
CN114329201B (en) Training method of deep learning model, content recommendation method and device
CN108628830A (en) A kind of method and apparatus of semantics recognition
CN107609192A (en) The supplement searching method and device of a kind of search engine
CN107392259B (en) Method and device for constructing unbalanced sample classification model
CN116028618B (en) Text processing method, text searching method, text processing device, text searching device, electronic equipment and storage medium
CN114861889A (en) Deep learning model training method, target object detection method and device
CN111861596A (en) Text classification method and device
CN113052262A (en) Form generation method and device, computer equipment and storage medium
CN110705271B (en) System and method for providing natural language processing service
CN110807097A (en) Method and device for analyzing data
US20230076471A1 (en) Training method, text translation method, electronic device, and storage medium
CN110895655A (en) Method and device for extracting text core phrase
CN110852078A (en) Method and device for generating title
CN113869042A (en) Text title generation method and device, electronic equipment and storage medium
CN113761882A (en) Dictionary construction method and device
CN112487765B (en) Method and device for generating notification text
CN113076395B (en) Semantic model training and search display method, device, equipment and storage medium
CN112528644B (en) Entity mounting method, device, equipment and storage medium
CN114445179A (en) Service recommendation method and device, electronic equipment and computer readable medium
CN112887426A (en) Information flow pushing method and device, electronic equipment and storage medium
CN107256244A (en) Data processing method and system
CN113762992A (en) Method and device for processing data
CN112016017A (en) Method and device for determining characteristic data
CN114492456B (en) Text generation method, model training method, device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination