CN112818693A

CN112818693A - Automatic extraction method and system for electronic component model words

Info

Publication number: CN112818693A
Application number: CN202110177411.8A
Authority: CN
Inventors: 樊芳华
Original assignee: Shenzhen Sekorm Component Network Co Ltd
Current assignee: Shenzhen Sekorm Component Network Co Ltd
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2021-05-18

Abstract

The invention discloses an automatic extraction method and system of electronic component model words, wherein the method comprises the following steps: constructing a model column name dictionary and training a model word presumption model according to the training documents; and obtaining a document to be extracted, and performing matching extraction of model words in the table according to the model column name dictionary, and/or performing inference extraction of model words in the text according to the model word inference model. By implementing the method, the model words of the components can be automatically extracted from massive electronic component data of electronic manufacturers, the labor input is reduced, the extraction accuracy is improved, and the e-commerce system experience is improved.

Description

Automatic extraction method and system for electronic component model words

Technical Field

The invention relates to the technical field of computer application, in particular to an automatic extraction method and system for electronic component model words.

Background

With the continuous development of social industrialization, the electronic industry is also developed vigorously, various electronic components are generated to meet the requirements of social industrialization, and a large amount of electronic component data are generated, wherein a large number of component models and specifications are recorded in the data and need to be extracted and used as keywords for searching corresponding components by users of e-commerce systems. At present, the industry does not have an effective method for automatically extracting models from massive articles, but relies on the identification, marking and extraction of human eyes, which not only consumes time and labor, but also causes a large number of model extraction errors due to different personnel qualities during extraction, influences the accuracy of user search of an e-commerce system, the inference of user search intention and the commodity recommendation effect, and causes poor user experience.

Disclosure of Invention

The invention aims to solve the technical problem of providing an automatic extraction method and system of electronic component model words aiming at the defects of the prior art.

The technical scheme adopted by the invention for solving the technical problems is as follows: an automatic extraction method for constructing electronic component model words comprises the following steps:

s1: constructing a model column name dictionary and training a model word presumption model according to the training documents;

s2: and obtaining a document to be extracted, and performing matching extraction of model words in the table according to the model column name dictionary, and/or performing inference extraction of model words in the text according to the model word inference model.

Preferably, in the method for automatically extracting a model word of an electronic component according to the present invention, the step S1 includes:

s11: extracting text data and/or table data from at least one training document; the training documents are marked type words documents;

s12: judging whether table data exists, if so, executing step S13, otherwise, executing step S14;

s13: constructing the model column name dictionary through empirical conjecture according to the header data of the table;

s14: and utilizing a word segmentation device to segment words of the text data, acquiring the model words with the marks after the words are segmented, and inputting the model words into the model word presumption model for recognition training.

Preferably, in the method for automatically extracting a model word of an electronic component according to the present invention, the step S2 includes:

s21: extracting text data and/or table data from at least one document to be extracted;

s22: judging whether table data exists, if so, executing step S23, otherwise, executing step S24;

s23: matching model words under the header in the table data according to the model column name dictionary, and extracting the model words in the table;

s24: and utilizing a word segmentation device to segment words of the text data, and carrying out speculation on whether the segmented words are model words or not according to the model word speculation model so as to extract the model words in the text.

Preferably, in the method for automatically extracting model words of electronic components according to the present invention, the model word inference model includes at least one single-group manufacturer model word inference model for inferring model words of a single group of manufacturers and/or all manufacturer model word inference models for inferring model words of all manufacturers;

the step S14 includes: acquiring corresponding text data according to manufacturer attributes to which training documents belong, segmenting the text data by using a word segmentation device, acquiring model words with marks after segmentation, and inputting the model words into the single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training;

and/or performing word segmentation on the text data by using a word segmentation device, acquiring the model words with the marks after word segmentation, and inputting the model words into the model word inference models of all manufacturers for recognition training;

the step S24 includes:

acquiring corresponding text data according to manufacturer attributes to which the documents to be extracted belong, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of the manufacturer according to the single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer in the text;

and/or performing word segmentation on the text data by using a word segmentation device, and performing inference on whether the word after word segmentation is a model word according to the model word inference model of all manufacturers, so as to extract the model word in the text.

Preferably, in the method for automatically extracting model words of electronic components according to the present invention, the method further includes: and discarding the picture data and/or the messy code data in the extraction process.

Preferably, in the method for automatically extracting model words of electronic components according to the present invention, the method further includes:

s3: and storing the extracted model words into a model word library, and marking the model words in the training document according to the model word library.

The invention also constructs an automatic extraction system of the electronic component model words, which comprises the following steps:

the training module is used for constructing a model column name dictionary and training a model word presumption model in advance according to a training document;

and the extraction module is used for obtaining the document to be extracted, matching and extracting model words in the table according to the model column name dictionary, and/or speculating and extracting the model words in the text according to the model word speculation model.

Preferably, in the automatic extraction system of electronic component type words according to the present invention, the training module includes:

the training data module is used for extracting text data and/or table data from at least one training document; the training documents are marked type words documents;

the training judgment module is used for judging whether table data exist or not, if so, executing the dictionary construction module, and if not, executing the model training module;

the dictionary construction module is used for constructing the model column name dictionary through empirical conjecture according to the header data of the table;

and the model training module is used for segmenting the text data by using the word segmenter, acquiring the model words with the marks after the word segmentation, and inputting the model words into the model word presumption model for recognition training.

Preferably, in the automatic extraction system of electronic component type words according to the present invention, the extraction module includes:

the data extraction module is used for extracting text data and/or table data from at least one document to be extracted;

the extraction judging module is used for judging whether table data exists or not, if so, the table extraction module is executed, and if not, the text extraction module is executed;

the table extraction module is used for matching model words under the table head in the table data according to the model column name dictionary and extracting the model words in the table;

and the text extraction module is used for segmenting the text data by using a word segmentation device, and extracting the model words in the text according to the model word presumption model for presuming whether the segmented words are model words.

Preferably, in the automatic extraction system of electronic component model words described in the present invention, the model word inference model includes at least one single-group manufacturer model word inference model for inferring model words of a single group of manufacturers and/or all manufacturer model word inference models for inferring model words of all manufacturers;

the model training module comprises:

the single-group manufacturer model training module is used for acquiring corresponding text data according to manufacturer attributes to which training documents belong, segmenting the text data by using a word segmentation device, acquiring model words with marks after segmentation, and inputting the model words into the single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training;

and/or all manufacturer model training modules are used for segmenting the text data by utilizing a word segmentation device, acquiring the model words with the marks after the word segmentation, and inputting the model words into all manufacturer model word presumption models for recognition training;

the text extraction module comprises:

the single-group manufacturer text extraction module is used for acquiring corresponding text data according to manufacturer attributes to which the documents to be extracted belong, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of the manufacturer according to the single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer from the text;

and/or all manufacturer text extraction modules are used for utilizing word segmenters to segment words of text data, and extracting model words in the text according to the inference of whether the segmented words are model words or not by the aid of all manufacturer model word inference models.

By implementing the invention, the following beneficial effects are achieved:

according to the method, the model column name dictionary is built according to the training document, the model word presumption model is trained, then the document to be extracted is obtained, the model words in the table are matched and extracted according to the model column name dictionary, and/or the model words in the text are presumed and extracted according to the model word presumption model, so that the model words of the components can be automatically extracted from mass electronic component data of electronic manufacturers, the labor input is reduced, the extraction accuracy is improved, and the e-commerce system experience is improved.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of a method for automatically extracting model words of electronic components according to the present invention;

FIG. 2 is a block diagram of an automatic extraction system for electronic component type words according to the present invention;

FIG. 3 is a general computation flow diagram for the Attention Model.

Detailed Description

For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

It should be noted that the flow charts shown in the drawings are only exemplary and do not necessarily include all the contents and operations/steps, nor do they necessarily have to be executed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

A first embodiment, as shown in fig. 1, discloses an automatic extraction method of electronic component type words, which includes the following steps:

step S1: constructing a model column name dictionary and training a model word presumption model according to the training documents;

step S2: and obtaining a document to be extracted, and performing matching extraction of model words in the table according to the model column name dictionary, and/or performing inference extraction of model words in the text according to the model word inference model.

Specifically, in this embodiment, the step S1 includes:

step S11: text data and/or tabular data is extracted from at least one training document. The training documents are the documents with marked model words, namely the documents with extracted model words. In some embodiments, the training document is a PDF formatted document of electronic component content obtained from the CMS system of the e-commerce system. In order to clear the junk data, filtering and clearing picture data and/or messy code data caused by PDF format problems in the process of extracting text data and/or form data, and/or correcting the data with disordered formats so as to avoid influencing the correctness of the system and keep the data consistent with the visual observation as much as possible;

step S12: judging whether table data exists, if so, executing step S13, otherwise, executing step S14;

step S13: and constructing a model column name dictionary through empirical presumption according to the header data of the table. In some embodiments, since most data types appear in the header of the table, the type words existing in the table can be summarized according to the header data of the table based on human experience, i.e. empirical conjecture, so as to form a type column name dictionary;

step S14: and performing word segmentation on the text data by using a word segmentation device, acquiring the model words with the marks after the word segmentation, and inputting the model words into a model word presumption model for recognition training. In some embodiments, a jieba tokenizer may be used to tokenize the Chinese text data, and a stanford tokenizer may be used to tokenize the English text data to obtain Chinese and English text data.

It should be noted that, for the intelligent extraction of model words in a text, the invention adopts a Named Entity Recognition (NER) method in natural language processing to extract form headers, segment text words, construct a model list name dictionary and establish a model word inference model in a large number of training documents with model words extracted and marked, and performs model word extraction matching and inference for new data (documents to be extracted). Therefore, step S1 further includes: and constructing a model column name dictionary and training a model word inference model according to the training document by using a specific named entity recognition method. The method specifically comprises the following steps: BilSTM-CRF + Attention.

The BilSTM-CRF is bidirectional Long-Short Term Memory artificial Neural network and conditional random field, and the LSTM is called Long Short-Term Memory, which is one of RNN (Current Neural network). LSTM is well suited for modeling time series data, such as text data, due to its design features.

BilSTM is an abbreviation of Bi-directional Long Short-Term Memory, and is formed by combining forward LSTM and backward LSTM. Both are often used to model context information in natural language processing tasks.

CRF is a serialization labeling algorithm (sequence labeling algorithm) that receives an input sequence such as X ═ X (X ═ X)₁+x₂，…，x_n) And outputting the target sequence Y ═ Y₁+y₂，…，y_n) It can also be considered as a seq2seq model. The sequence is denoted here by capital X, Y. For example, in the part-of-speech tagging task, the input sequence is a string of words, and the output sequence is the corresponding part-of-speech.

The Attention is an Attention Model, which actually simulates an Attention Model of the human brain, for example, when a picture is viewed, although the whole picture can be seen, when the picture is deeply and carefully observed, only a small block of the picture is focused on the eyes, and at this time, the human brain mainly focuses on the small block of the picture, that is, the Attention of the human brain to the whole picture is not balanced at this time, and the picture is distinguished by certain weight, which is the core idea of the Attention Model in deep learning. The overall calculation flow is shown in fig. 3.

In this embodiment, the step S2 includes:

step S21: extracting text data and/or table data from at least one document to be extracted; the document to be extracted is a new unmarked model word, namely a document without model words.

Step S22: judging whether table data exists, if so, executing step S23, otherwise, executing step S24;

step S23: matching model words under a header in table data according to the model column name dictionary, and extracting the model words in the table;

step S24: and utilizing a word segmentation device to segment words of the text data, and carrying out speculation on whether the segmented words are model words or not according to the model word speculation model so as to extract the model words in the text. After the model word presumption model is trained in the step S1, the general attributes and the judgment rules of the model words can be obtained, and through continuous training, the general attributes and the judgment rules can be further improved, and the presumption accuracy of the model word presumption model is improved.

In some embodiments, the training document and the document to be extracted may generally include documents of multiple vendors, each of which may have corresponding vendor attributes, such as identification numbers, etc., and may sometimes need to accurately guess model words of a certain vendor or some similar vendors, so that the model word inference model includes at least one single-set vendor model word inference model for inferring model words of a single set of vendors and/or one all vendor model word inference model for inferring model words of all vendors. Wherein, each single group of manufacturer model word inference model respectively corresponds to one manufacturer or a plurality of similar manufacturers. And the word inference model of all manufacturers trains and infers the documents of all manufacturers and does not train and infer the documents according to manufacturers.

Accordingly, the step S14 includes: obtaining corresponding text data according to manufacturer attributes to which the training documents belong, utilizing a word segmentation device to segment the text data, obtaining model words with marks after word segmentation, and inputting the model words into a single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training.

accordingly, the step S24 includes:

acquiring corresponding text data according to manufacturer attributes to which a document to be extracted belongs, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of a manufacturer according to a single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer from the text;

and/or performing word segmentation on the text data by using a word segmentation device, and performing inference on whether the word after word segmentation is a model word according to all manufacturer model word inference models, so as to extract the model word in the text.

In this embodiment, the method for automatically extracting the type words of the electronic component further includes:

step S3: and storing the extracted model words into a model word library, and marking the model words in the training document according to the model word library. The method and the device continuously extract the model words of the new document to be extracted to obtain the new model words and mark the new model words in the training document, thereby continuously perfecting the construction of the model column name dictionary and the training of the model word presumption model and ensuring that the accuracy of subsequent matching extraction and presumption extraction is higher. In some embodiments, before step S3, the method further includes aggregating the model words extracted by the model column name dictionary and the model word inference model, and storing the aggregated model words into a model word library.

In some embodiments, when the model words extracted by a single set of vendor model word inference models and all vendor model word inference models are the same, they may be merged and then stored in a model lexicon. In other embodiments, the words which are extracted by mistake and do not belong to the model words can be filtered through summarizing and filtering before being stored in the model word library, and the filtering can be automatic filtering or manual filtering.

A second embodiment, as shown in fig. 2, discloses an automatic extraction system for electronic component type words, including:

and the extraction module is used for obtaining the document to be extracted, matching and extracting the model words in the table according to the model column name dictionary, and/or speculating and extracting the model words in the text according to the model word speculation model.

Specifically, in this embodiment, the training module includes:

the training data module is used for extracting text data and/or table data from at least one training document; the training documents are the documents with marked model words, namely the documents with extracted model words. In some embodiments, the training document is a PDF formatted document of electronic component content obtained from the CMS system of the e-commerce system. Preferably, in order to clean the junk data, the system further comprises a cleaning module, which is used for filtering and cleaning the picture data and/or the messy code data caused by the PDF format problem in the process of extracting the text data and/or the form data, and/or correcting the data with disordered format so as to avoid affecting the correctness of the system and keep the data consistent with the observation of naked eyes as much as possible;

the dictionary building module is used for building a model column name dictionary through empirical conjecture according to the header data of the table; in some embodiments, since most data types appear in the header of the table, the type words existing in the table can be summarized according to the header data of the table based on human experience, i.e. empirical conjecture, so as to form a type column name dictionary;

and the model training module is used for segmenting the text data by using the word segmenter, acquiring the model words with the marks after the word segmentation, and inputting the model words into the model word presumption model for recognition training. In some embodiments, a jieba tokenizer may be used to tokenize the Chinese text data, and a stanford tokenizer may be used to tokenize the English text data to obtain Chinese and English text data.

It should be noted that, for the intelligent extraction of model words in a text, the invention adopts a Named Entity Recognition (NER) method in natural language processing to extract form headers, segment text words, construct a model list name dictionary and establish a model word inference model in a large number of training documents with model words extracted and marked, and performs model word extraction matching and inference for new data (documents to be extracted). Therefore, the training module is further used for constructing the model column name dictionary and training the model word presumption model according to the training document by utilizing a specific named entity recognition method. The method specifically comprises the following steps: BilSTM-CRF + Attention.

In this embodiment, the extracting module includes:

the data extraction module is used for extracting text data and/or table data from at least one document to be extracted; the document to be extracted is a new unmarked model word, namely a document without model words.

and the text extraction module is used for segmenting the text data by using the word segmenter, and extracting the model words in the text according to the model word presumption model for presuming whether the segmented words are model words. After the model word presumption model is trained by the training module, the general attributes and the judgment rules of the model words can be obtained, the general attributes and the judgment rules can be further improved through continuous training, the presumption accuracy of the model word presumption model is improved, and when the word after word segmentation is obtained, whether the word belongs to the model words or not is presumed according to the obtained general attributes and the judgment rules of the model words, so that the model words in the text are extracted.

Accordingly, the model training module comprises:

the single-group manufacturer model training module is used for acquiring corresponding text data according to manufacturer attributes to which training documents belong, segmenting the text data by using a word segmentation device, acquiring model words with marks after segmentation, and inputting the model words into a single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training;

the text extraction module comprises:

the single-group manufacturer text extraction module is used for acquiring corresponding text data according to manufacturer attributes to which the documents to be extracted belong, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of the manufacturer according to a single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer in the text;

and/or all manufacturer text extraction modules are used for utilizing the word segmentation device to segment words of the text data, and extracting model words in the text according to the inference of whether the segmented words are model words or not by all manufacturer model word inference models.

In this embodiment, the automatic extraction system for electronic component type words further includes:

and the marking module is used for storing the extracted model words into a model word library and marking the model words in the training document according to the model word library. The method and the device continuously extract the model words of the new document to be extracted to obtain the new model words and mark the new model words in the training document, thereby continuously perfecting the construction of the model column name dictionary and the training of the model word presumption model and ensuring that the accuracy of subsequent matching extraction and presumption extraction is higher. In some embodiments, the model word library is further configured to store the model words extracted by the model column name dictionary and the model word inference model into the model word library after the model words are aggregated.

In some embodiments, when the model words extracted by a single set of vendor model word inference models and all vendor model word inference models are the same, they may be merged and then stored in a model lexicon. In some other embodiments, the system further includes a filtering module, configured to filter out words that are extracted by mistake and do not belong to the model words by summarizing and filtering before storing in the model word bank, where the filtering may be automatic filtering or manual filtering.

By implementing the invention, the following beneficial effects are achieved:

It is to be understood that the foregoing examples, while indicating the preferred embodiments of the invention, are given by way of illustration and description, and are not to be construed as limiting the scope of the invention; it should be noted that, for those skilled in the art, the above technical features can be freely combined, and several changes and modifications can be made without departing from the concept of the present invention, which all belong to the protection scope of the present invention; therefore, all equivalent changes and modifications made within the scope of the claims of the present invention should be covered by the claims of the present invention.

Claims

1. An automatic extraction method of electronic component model words is characterized by comprising the following steps:

2. The method for automatically extracting words of electronic component types according to claim 1, wherein the step S1 includes:

3. The method for automatically extracting words of electronic component types according to claim 2, wherein the step S2 includes:

4. The automatic extraction method of the electronic component model words as claimed in claim 3, wherein the model word inference model includes at least one single-set manufacturer model word inference model for inferring model words of a single set of manufacturers and/or all manufacturer model word inference models for inferring model words of all manufacturers;

the step S24 includes:

5. The method for automatically extracting the type words of the electronic components as claimed in claim 3, further comprising: and discarding the picture data and/or the messy code data in the extraction process.

6. The method for automatically extracting the type words of the electronic components as claimed in claim 1, further comprising:

7. An automatic extraction system of electronic component model words is characterized by comprising:

8. The system for automatically extracting words of electronic component types according to claim 7, wherein the training module comprises:

9. The system for automatically extracting words of electronic component types according to claim 8, wherein the extraction module comprises:

10. The system for automatically extracting model words of electronic components according to claim 9, wherein the model word inference model includes at least one single-set manufacturer model word inference model for inferring model words of a single set of manufacturers and/or all manufacturer model word inference models for inferring model words of all manufacturers;

the model training module comprises:

the text extraction module comprises: