CN112818693A - Automatic extraction method and system for electronic component model words - Google Patents
Automatic extraction method and system for electronic component model words Download PDFInfo
- Publication number
- CN112818693A CN112818693A CN202110177411.8A CN202110177411A CN112818693A CN 112818693 A CN112818693 A CN 112818693A CN 202110177411 A CN202110177411 A CN 202110177411A CN 112818693 A CN112818693 A CN 112818693A
- Authority
- CN
- China
- Prior art keywords
- model
- words
- word
- training
- manufacturer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 70
- 238000012549 training Methods 0.000 claims abstract description 101
- 238000000034 method Methods 0.000 claims abstract description 29
- 230000011218 segmentation Effects 0.000 claims description 57
- 238000010276 construction Methods 0.000 claims description 7
- 238000013075 data extraction Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 description 11
- 210000004556 brain Anatomy 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000002372 labelling Methods 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 4
- 230000006403 short-term memory Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 210000003128 head Anatomy 0.000 description 2
- 230000015654 memory Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses an automatic extraction method and system of electronic component model words, wherein the method comprises the following steps: constructing a model column name dictionary and training a model word presumption model according to the training documents; and obtaining a document to be extracted, and performing matching extraction of model words in the table according to the model column name dictionary, and/or performing inference extraction of model words in the text according to the model word inference model. By implementing the method, the model words of the components can be automatically extracted from massive electronic component data of electronic manufacturers, the labor input is reduced, the extraction accuracy is improved, and the e-commerce system experience is improved.
Description
Technical Field
The invention relates to the technical field of computer application, in particular to an automatic extraction method and system for electronic component model words.
Background
With the continuous development of social industrialization, the electronic industry is also developed vigorously, various electronic components are generated to meet the requirements of social industrialization, and a large amount of electronic component data are generated, wherein a large number of component models and specifications are recorded in the data and need to be extracted and used as keywords for searching corresponding components by users of e-commerce systems. At present, the industry does not have an effective method for automatically extracting models from massive articles, but relies on the identification, marking and extraction of human eyes, which not only consumes time and labor, but also causes a large number of model extraction errors due to different personnel qualities during extraction, influences the accuracy of user search of an e-commerce system, the inference of user search intention and the commodity recommendation effect, and causes poor user experience.
Disclosure of Invention
The invention aims to solve the technical problem of providing an automatic extraction method and system of electronic component model words aiming at the defects of the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows: an automatic extraction method for constructing electronic component model words comprises the following steps:
s1: constructing a model column name dictionary and training a model word presumption model according to the training documents;
s2: and obtaining a document to be extracted, and performing matching extraction of model words in the table according to the model column name dictionary, and/or performing inference extraction of model words in the text according to the model word inference model.
Preferably, in the method for automatically extracting a model word of an electronic component according to the present invention, the step S1 includes:
s11: extracting text data and/or table data from at least one training document; the training documents are marked type words documents;
s12: judging whether table data exists, if so, executing step S13, otherwise, executing step S14;
s13: constructing the model column name dictionary through empirical conjecture according to the header data of the table;
s14: and utilizing a word segmentation device to segment words of the text data, acquiring the model words with the marks after the words are segmented, and inputting the model words into the model word presumption model for recognition training.
Preferably, in the method for automatically extracting a model word of an electronic component according to the present invention, the step S2 includes:
s21: extracting text data and/or table data from at least one document to be extracted;
s22: judging whether table data exists, if so, executing step S23, otherwise, executing step S24;
s23: matching model words under the header in the table data according to the model column name dictionary, and extracting the model words in the table;
s24: and utilizing a word segmentation device to segment words of the text data, and carrying out speculation on whether the segmented words are model words or not according to the model word speculation model so as to extract the model words in the text.
Preferably, in the method for automatically extracting model words of electronic components according to the present invention, the model word inference model includes at least one single-group manufacturer model word inference model for inferring model words of a single group of manufacturers and/or all manufacturer model word inference models for inferring model words of all manufacturers;
the step S14 includes: acquiring corresponding text data according to manufacturer attributes to which training documents belong, segmenting the text data by using a word segmentation device, acquiring model words with marks after segmentation, and inputting the model words into the single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training;
and/or performing word segmentation on the text data by using a word segmentation device, acquiring the model words with the marks after word segmentation, and inputting the model words into the model word inference models of all manufacturers for recognition training;
the step S24 includes:
acquiring corresponding text data according to manufacturer attributes to which the documents to be extracted belong, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of the manufacturer according to the single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer in the text;
and/or performing word segmentation on the text data by using a word segmentation device, and performing inference on whether the word after word segmentation is a model word according to the model word inference model of all manufacturers, so as to extract the model word in the text.
Preferably, in the method for automatically extracting model words of electronic components according to the present invention, the method further includes: and discarding the picture data and/or the messy code data in the extraction process.
Preferably, in the method for automatically extracting model words of electronic components according to the present invention, the method further includes:
s3: and storing the extracted model words into a model word library, and marking the model words in the training document according to the model word library.
The invention also constructs an automatic extraction system of the electronic component model words, which comprises the following steps:
the training module is used for constructing a model column name dictionary and training a model word presumption model in advance according to a training document;
and the extraction module is used for obtaining the document to be extracted, matching and extracting model words in the table according to the model column name dictionary, and/or speculating and extracting the model words in the text according to the model word speculation model.
Preferably, in the automatic extraction system of electronic component type words according to the present invention, the training module includes:
the training data module is used for extracting text data and/or table data from at least one training document; the training documents are marked type words documents;
the training judgment module is used for judging whether table data exist or not, if so, executing the dictionary construction module, and if not, executing the model training module;
the dictionary construction module is used for constructing the model column name dictionary through empirical conjecture according to the header data of the table;
and the model training module is used for segmenting the text data by using the word segmenter, acquiring the model words with the marks after the word segmentation, and inputting the model words into the model word presumption model for recognition training.
Preferably, in the automatic extraction system of electronic component type words according to the present invention, the extraction module includes:
the data extraction module is used for extracting text data and/or table data from at least one document to be extracted;
the extraction judging module is used for judging whether table data exists or not, if so, the table extraction module is executed, and if not, the text extraction module is executed;
the table extraction module is used for matching model words under the table head in the table data according to the model column name dictionary and extracting the model words in the table;
and the text extraction module is used for segmenting the text data by using a word segmentation device, and extracting the model words in the text according to the model word presumption model for presuming whether the segmented words are model words.
Preferably, in the automatic extraction system of electronic component model words described in the present invention, the model word inference model includes at least one single-group manufacturer model word inference model for inferring model words of a single group of manufacturers and/or all manufacturer model word inference models for inferring model words of all manufacturers;
the model training module comprises:
the single-group manufacturer model training module is used for acquiring corresponding text data according to manufacturer attributes to which training documents belong, segmenting the text data by using a word segmentation device, acquiring model words with marks after segmentation, and inputting the model words into the single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training;
and/or all manufacturer model training modules are used for segmenting the text data by utilizing a word segmentation device, acquiring the model words with the marks after the word segmentation, and inputting the model words into all manufacturer model word presumption models for recognition training;
the text extraction module comprises:
the single-group manufacturer text extraction module is used for acquiring corresponding text data according to manufacturer attributes to which the documents to be extracted belong, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of the manufacturer according to the single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer from the text;
and/or all manufacturer text extraction modules are used for utilizing word segmenters to segment words of text data, and extracting model words in the text according to the inference of whether the segmented words are model words or not by the aid of all manufacturer model word inference models.
By implementing the invention, the following beneficial effects are achieved:
according to the method, the model column name dictionary is built according to the training document, the model word presumption model is trained, then the document to be extracted is obtained, the model words in the table are matched and extracted according to the model column name dictionary, and/or the model words in the text are presumed and extracted according to the model word presumption model, so that the model words of the components can be automatically extracted from mass electronic component data of electronic manufacturers, the labor input is reduced, the extraction accuracy is improved, and the e-commerce system experience is improved.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a flow chart of a method for automatically extracting model words of electronic components according to the present invention;
FIG. 2 is a block diagram of an automatic extraction system for electronic component type words according to the present invention;
FIG. 3 is a general computation flow diagram for the Attention Model.
Detailed Description
For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
It should be noted that the flow charts shown in the drawings are only exemplary and do not necessarily include all the contents and operations/steps, nor do they necessarily have to be executed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
A first embodiment, as shown in fig. 1, discloses an automatic extraction method of electronic component type words, which includes the following steps:
step S1: constructing a model column name dictionary and training a model word presumption model according to the training documents;
step S2: and obtaining a document to be extracted, and performing matching extraction of model words in the table according to the model column name dictionary, and/or performing inference extraction of model words in the text according to the model word inference model.
Specifically, in this embodiment, the step S1 includes:
step S11: text data and/or tabular data is extracted from at least one training document. The training documents are the documents with marked model words, namely the documents with extracted model words. In some embodiments, the training document is a PDF formatted document of electronic component content obtained from the CMS system of the e-commerce system. In order to clear the junk data, filtering and clearing picture data and/or messy code data caused by PDF format problems in the process of extracting text data and/or form data, and/or correcting the data with disordered formats so as to avoid influencing the correctness of the system and keep the data consistent with the visual observation as much as possible;
step S12: judging whether table data exists, if so, executing step S13, otherwise, executing step S14;
step S13: and constructing a model column name dictionary through empirical presumption according to the header data of the table. In some embodiments, since most data types appear in the header of the table, the type words existing in the table can be summarized according to the header data of the table based on human experience, i.e. empirical conjecture, so as to form a type column name dictionary;
step S14: and performing word segmentation on the text data by using a word segmentation device, acquiring the model words with the marks after the word segmentation, and inputting the model words into a model word presumption model for recognition training. In some embodiments, a jieba tokenizer may be used to tokenize the Chinese text data, and a stanford tokenizer may be used to tokenize the English text data to obtain Chinese and English text data.
It should be noted that, for the intelligent extraction of model words in a text, the invention adopts a Named Entity Recognition (NER) method in natural language processing to extract form headers, segment text words, construct a model list name dictionary and establish a model word inference model in a large number of training documents with model words extracted and marked, and performs model word extraction matching and inference for new data (documents to be extracted). Therefore, step S1 further includes: and constructing a model column name dictionary and training a model word inference model according to the training document by using a specific named entity recognition method. The method specifically comprises the following steps: BilSTM-CRF + Attention.
The BilSTM-CRF is bidirectional Long-Short Term Memory artificial Neural network and conditional random field, and the LSTM is called Long Short-Term Memory, which is one of RNN (Current Neural network). LSTM is well suited for modeling time series data, such as text data, due to its design features.
BilSTM is an abbreviation of Bi-directional Long Short-Term Memory, and is formed by combining forward LSTM and backward LSTM. Both are often used to model context information in natural language processing tasks.
CRF is a serialization labeling algorithm (sequence labeling algorithm) that receives an input sequence such as X ═ X (X ═ X)1+x2,…,xn) And outputting the target sequence Y ═ Y1+y2,…,yn) It can also be considered as a seq2seq model. The sequence is denoted here by capital X, Y. For example, in the part-of-speech tagging task, the input sequence is a string of words, and the output sequence is the corresponding part-of-speech.
The Attention is an Attention Model, which actually simulates an Attention Model of the human brain, for example, when a picture is viewed, although the whole picture can be seen, when the picture is deeply and carefully observed, only a small block of the picture is focused on the eyes, and at this time, the human brain mainly focuses on the small block of the picture, that is, the Attention of the human brain to the whole picture is not balanced at this time, and the picture is distinguished by certain weight, which is the core idea of the Attention Model in deep learning. The overall calculation flow is shown in fig. 3.
In this embodiment, the step S2 includes:
step S21: extracting text data and/or table data from at least one document to be extracted; the document to be extracted is a new unmarked model word, namely a document without model words.
Step S22: judging whether table data exists, if so, executing step S23, otherwise, executing step S24;
step S23: matching model words under a header in table data according to the model column name dictionary, and extracting the model words in the table;
step S24: and utilizing a word segmentation device to segment words of the text data, and carrying out speculation on whether the segmented words are model words or not according to the model word speculation model so as to extract the model words in the text. After the model word presumption model is trained in the step S1, the general attributes and the judgment rules of the model words can be obtained, and through continuous training, the general attributes and the judgment rules can be further improved, and the presumption accuracy of the model word presumption model is improved.
In some embodiments, the training document and the document to be extracted may generally include documents of multiple vendors, each of which may have corresponding vendor attributes, such as identification numbers, etc., and may sometimes need to accurately guess model words of a certain vendor or some similar vendors, so that the model word inference model includes at least one single-set vendor model word inference model for inferring model words of a single set of vendors and/or one all vendor model word inference model for inferring model words of all vendors. Wherein, each single group of manufacturer model word inference model respectively corresponds to one manufacturer or a plurality of similar manufacturers. And the word inference model of all manufacturers trains and infers the documents of all manufacturers and does not train and infer the documents according to manufacturers.
Accordingly, the step S14 includes: obtaining corresponding text data according to manufacturer attributes to which the training documents belong, utilizing a word segmentation device to segment the text data, obtaining model words with marks after word segmentation, and inputting the model words into a single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training.
And/or performing word segmentation on the text data by using a word segmentation device, acquiring the model words with the marks after word segmentation, and inputting the model words into the model word inference models of all manufacturers for recognition training;
accordingly, the step S24 includes:
acquiring corresponding text data according to manufacturer attributes to which a document to be extracted belongs, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of a manufacturer according to a single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer from the text;
and/or performing word segmentation on the text data by using a word segmentation device, and performing inference on whether the word after word segmentation is a model word according to all manufacturer model word inference models, so as to extract the model word in the text.
In this embodiment, the method for automatically extracting the type words of the electronic component further includes:
step S3: and storing the extracted model words into a model word library, and marking the model words in the training document according to the model word library. The method and the device continuously extract the model words of the new document to be extracted to obtain the new model words and mark the new model words in the training document, thereby continuously perfecting the construction of the model column name dictionary and the training of the model word presumption model and ensuring that the accuracy of subsequent matching extraction and presumption extraction is higher. In some embodiments, before step S3, the method further includes aggregating the model words extracted by the model column name dictionary and the model word inference model, and storing the aggregated model words into a model word library.
In some embodiments, when the model words extracted by a single set of vendor model word inference models and all vendor model word inference models are the same, they may be merged and then stored in a model lexicon. In other embodiments, the words which are extracted by mistake and do not belong to the model words can be filtered through summarizing and filtering before being stored in the model word library, and the filtering can be automatic filtering or manual filtering.
A second embodiment, as shown in fig. 2, discloses an automatic extraction system for electronic component type words, including:
the training module is used for constructing a model column name dictionary and training a model word presumption model in advance according to a training document;
and the extraction module is used for obtaining the document to be extracted, matching and extracting the model words in the table according to the model column name dictionary, and/or speculating and extracting the model words in the text according to the model word speculation model.
Specifically, in this embodiment, the training module includes:
the training data module is used for extracting text data and/or table data from at least one training document; the training documents are the documents with marked model words, namely the documents with extracted model words. In some embodiments, the training document is a PDF formatted document of electronic component content obtained from the CMS system of the e-commerce system. Preferably, in order to clean the junk data, the system further comprises a cleaning module, which is used for filtering and cleaning the picture data and/or the messy code data caused by the PDF format problem in the process of extracting the text data and/or the form data, and/or correcting the data with disordered format so as to avoid affecting the correctness of the system and keep the data consistent with the observation of naked eyes as much as possible;
the training judgment module is used for judging whether table data exist or not, if so, executing the dictionary construction module, and if not, executing the model training module;
the dictionary building module is used for building a model column name dictionary through empirical conjecture according to the header data of the table; in some embodiments, since most data types appear in the header of the table, the type words existing in the table can be summarized according to the header data of the table based on human experience, i.e. empirical conjecture, so as to form a type column name dictionary;
and the model training module is used for segmenting the text data by using the word segmenter, acquiring the model words with the marks after the word segmentation, and inputting the model words into the model word presumption model for recognition training. In some embodiments, a jieba tokenizer may be used to tokenize the Chinese text data, and a stanford tokenizer may be used to tokenize the English text data to obtain Chinese and English text data.
It should be noted that, for the intelligent extraction of model words in a text, the invention adopts a Named Entity Recognition (NER) method in natural language processing to extract form headers, segment text words, construct a model list name dictionary and establish a model word inference model in a large number of training documents with model words extracted and marked, and performs model word extraction matching and inference for new data (documents to be extracted). Therefore, the training module is further used for constructing the model column name dictionary and training the model word presumption model according to the training document by utilizing a specific named entity recognition method. The method specifically comprises the following steps: BilSTM-CRF + Attention.
The BilSTM-CRF is bidirectional Long-Short Term Memory artificial Neural network and conditional random field, and the LSTM is called Long Short-Term Memory, which is one of RNN (Current Neural network). LSTM is well suited for modeling time series data, such as text data, due to its design features.
BilSTM is an abbreviation of Bi-directional Long Short-Term Memory, and is formed by combining forward LSTM and backward LSTM. Both are often used to model context information in natural language processing tasks.
CRF is a serialization labeling algorithm (sequence labeling algorithm) that receives an input sequence such as X ═ X (X ═ X)1+x2,…,xn) And outputting the target sequence Y ═ Y1+y2,…,yn) It can also be considered as a seq2seq model. The sequence is denoted here by capital X, Y. For example, in the part-of-speech tagging task, the input sequence is a string of words, and the output sequence is the corresponding part-of-speech.
The Attention is an Attention Model, which actually simulates an Attention Model of the human brain, for example, when a picture is viewed, although the whole picture can be seen, when the picture is deeply and carefully observed, only a small block of the picture is focused on the eyes, and at this time, the human brain mainly focuses on the small block of the picture, that is, the Attention of the human brain to the whole picture is not balanced at this time, and the picture is distinguished by certain weight, which is the core idea of the Attention Model in deep learning. The overall calculation flow is shown in fig. 3.
In this embodiment, the extracting module includes:
the data extraction module is used for extracting text data and/or table data from at least one document to be extracted; the document to be extracted is a new unmarked model word, namely a document without model words.
The extraction judging module is used for judging whether table data exists or not, if so, the table extraction module is executed, and if not, the text extraction module is executed;
the table extraction module is used for matching model words under the table head in the table data according to the model column name dictionary and extracting the model words in the table;
and the text extraction module is used for segmenting the text data by using the word segmenter, and extracting the model words in the text according to the model word presumption model for presuming whether the segmented words are model words. After the model word presumption model is trained by the training module, the general attributes and the judgment rules of the model words can be obtained, the general attributes and the judgment rules can be further improved through continuous training, the presumption accuracy of the model word presumption model is improved, and when the word after word segmentation is obtained, whether the word belongs to the model words or not is presumed according to the obtained general attributes and the judgment rules of the model words, so that the model words in the text are extracted.
In some embodiments, the training document and the document to be extracted may generally include documents of multiple vendors, each of which may have corresponding vendor attributes, such as identification numbers, etc., and may sometimes need to accurately guess model words of a certain vendor or some similar vendors, so that the model word inference model includes at least one single-set vendor model word inference model for inferring model words of a single set of vendors and/or one all vendor model word inference model for inferring model words of all vendors. Wherein, each single group of manufacturer model word inference model respectively corresponds to one manufacturer or a plurality of similar manufacturers. And the word inference model of all manufacturers trains and infers the documents of all manufacturers and does not train and infer the documents according to manufacturers.
Accordingly, the model training module comprises:
the single-group manufacturer model training module is used for acquiring corresponding text data according to manufacturer attributes to which training documents belong, segmenting the text data by using a word segmentation device, acquiring model words with marks after segmentation, and inputting the model words into a single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training;
and/or all manufacturer model training modules are used for segmenting the text data by utilizing a word segmentation device, acquiring the model words with the marks after the word segmentation, and inputting the model words into all manufacturer model word presumption models for recognition training;
the text extraction module comprises:
the single-group manufacturer text extraction module is used for acquiring corresponding text data according to manufacturer attributes to which the documents to be extracted belong, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of the manufacturer according to a single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer in the text;
and/or all manufacturer text extraction modules are used for utilizing the word segmentation device to segment words of the text data, and extracting model words in the text according to the inference of whether the segmented words are model words or not by all manufacturer model word inference models.
In this embodiment, the automatic extraction system for electronic component type words further includes:
and the marking module is used for storing the extracted model words into a model word library and marking the model words in the training document according to the model word library. The method and the device continuously extract the model words of the new document to be extracted to obtain the new model words and mark the new model words in the training document, thereby continuously perfecting the construction of the model column name dictionary and the training of the model word presumption model and ensuring that the accuracy of subsequent matching extraction and presumption extraction is higher. In some embodiments, the model word library is further configured to store the model words extracted by the model column name dictionary and the model word inference model into the model word library after the model words are aggregated.
In some embodiments, when the model words extracted by a single set of vendor model word inference models and all vendor model word inference models are the same, they may be merged and then stored in a model lexicon. In some other embodiments, the system further includes a filtering module, configured to filter out words that are extracted by mistake and do not belong to the model words by summarizing and filtering before storing in the model word bank, where the filtering may be automatic filtering or manual filtering.
By implementing the invention, the following beneficial effects are achieved:
according to the method, the model column name dictionary is built according to the training document, the model word presumption model is trained, then the document to be extracted is obtained, the model words in the table are matched and extracted according to the model column name dictionary, and/or the model words in the text are presumed and extracted according to the model word presumption model, so that the model words of the components can be automatically extracted from mass electronic component data of electronic manufacturers, the labor input is reduced, the extraction accuracy is improved, and the e-commerce system experience is improved.
It is to be understood that the foregoing examples, while indicating the preferred embodiments of the invention, are given by way of illustration and description, and are not to be construed as limiting the scope of the invention; it should be noted that, for those skilled in the art, the above technical features can be freely combined, and several changes and modifications can be made without departing from the concept of the present invention, which all belong to the protection scope of the present invention; therefore, all equivalent changes and modifications made within the scope of the claims of the present invention should be covered by the claims of the present invention.
Claims (10)
1. An automatic extraction method of electronic component model words is characterized by comprising the following steps:
s1: constructing a model column name dictionary and training a model word presumption model according to the training documents;
s2: and obtaining a document to be extracted, and performing matching extraction of model words in the table according to the model column name dictionary, and/or performing inference extraction of model words in the text according to the model word inference model.
2. The method for automatically extracting words of electronic component types according to claim 1, wherein the step S1 includes:
s11: extracting text data and/or table data from at least one training document; the training documents are marked type words documents;
s12: judging whether table data exists, if so, executing step S13, otherwise, executing step S14;
s13: constructing the model column name dictionary through empirical conjecture according to the header data of the table;
s14: and utilizing a word segmentation device to segment words of the text data, acquiring the model words with the marks after the words are segmented, and inputting the model words into the model word presumption model for recognition training.
3. The method for automatically extracting words of electronic component types according to claim 2, wherein the step S2 includes:
s21: extracting text data and/or table data from at least one document to be extracted;
s22: judging whether table data exists, if so, executing step S23, otherwise, executing step S24;
s23: matching model words under the header in the table data according to the model column name dictionary, and extracting the model words in the table;
s24: and utilizing a word segmentation device to segment words of the text data, and carrying out speculation on whether the segmented words are model words or not according to the model word speculation model so as to extract the model words in the text.
4. The automatic extraction method of the electronic component model words as claimed in claim 3, wherein the model word inference model includes at least one single-set manufacturer model word inference model for inferring model words of a single set of manufacturers and/or all manufacturer model word inference models for inferring model words of all manufacturers;
the step S14 includes: acquiring corresponding text data according to manufacturer attributes to which training documents belong, segmenting the text data by using a word segmentation device, acquiring model words with marks after segmentation, and inputting the model words into the single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training;
and/or performing word segmentation on the text data by using a word segmentation device, acquiring the model words with the marks after word segmentation, and inputting the model words into the model word inference models of all manufacturers for recognition training;
the step S24 includes:
acquiring corresponding text data according to manufacturer attributes to which the documents to be extracted belong, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of the manufacturer according to the single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer in the text;
and/or performing word segmentation on the text data by using a word segmentation device, and performing inference on whether the word after word segmentation is a model word according to the model word inference model of all manufacturers, so as to extract the model word in the text.
5. The method for automatically extracting the type words of the electronic components as claimed in claim 3, further comprising: and discarding the picture data and/or the messy code data in the extraction process.
6. The method for automatically extracting the type words of the electronic components as claimed in claim 1, further comprising:
s3: and storing the extracted model words into a model word library, and marking the model words in the training document according to the model word library.
7. An automatic extraction system of electronic component model words is characterized by comprising:
the training module is used for constructing a model column name dictionary and training a model word presumption model in advance according to a training document;
and the extraction module is used for obtaining the document to be extracted, matching and extracting model words in the table according to the model column name dictionary, and/or speculating and extracting the model words in the text according to the model word speculation model.
8. The system for automatically extracting words of electronic component types according to claim 7, wherein the training module comprises:
the training data module is used for extracting text data and/or table data from at least one training document; the training documents are marked type words documents;
the training judgment module is used for judging whether table data exist or not, if so, executing the dictionary construction module, and if not, executing the model training module;
the dictionary construction module is used for constructing the model column name dictionary through empirical conjecture according to the header data of the table;
and the model training module is used for segmenting the text data by using the word segmenter, acquiring the model words with the marks after the word segmentation, and inputting the model words into the model word presumption model for recognition training.
9. The system for automatically extracting words of electronic component types according to claim 8, wherein the extraction module comprises:
the data extraction module is used for extracting text data and/or table data from at least one document to be extracted;
the extraction judging module is used for judging whether table data exists or not, if so, the table extraction module is executed, and if not, the text extraction module is executed;
the table extraction module is used for matching model words under the table head in the table data according to the model column name dictionary and extracting the model words in the table;
and the text extraction module is used for segmenting the text data by using a word segmentation device, and extracting the model words in the text according to the model word presumption model for presuming whether the segmented words are model words.
10. The system for automatically extracting model words of electronic components according to claim 9, wherein the model word inference model includes at least one single-set manufacturer model word inference model for inferring model words of a single set of manufacturers and/or all manufacturer model word inference models for inferring model words of all manufacturers;
the model training module comprises:
the single-group manufacturer model training module is used for acquiring corresponding text data according to manufacturer attributes to which training documents belong, segmenting the text data by using a word segmentation device, acquiring model words with marks after segmentation, and inputting the model words into the single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training;
and/or all manufacturer model training modules are used for segmenting the text data by utilizing a word segmentation device, acquiring the model words with the marks after the word segmentation, and inputting the model words into all manufacturer model word presumption models for recognition training;
the text extraction module comprises:
the single-group manufacturer text extraction module is used for acquiring corresponding text data according to manufacturer attributes to which the documents to be extracted belong, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of the manufacturer according to the single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer from the text;
and/or all manufacturer text extraction modules are used for utilizing word segmenters to segment words of text data, and extracting model words in the text according to the inference of whether the segmented words are model words or not by the aid of all manufacturer model word inference models.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110177411.8A CN112818693A (en) | 2021-02-07 | 2021-02-07 | Automatic extraction method and system for electronic component model words |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110177411.8A CN112818693A (en) | 2021-02-07 | 2021-02-07 | Automatic extraction method and system for electronic component model words |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112818693A true CN112818693A (en) | 2021-05-18 |
Family
ID=75864680
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110177411.8A Pending CN112818693A (en) | 2021-02-07 | 2021-02-07 | Automatic extraction method and system for electronic component model words |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112818693A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113609279A (en) * | 2021-08-05 | 2021-11-05 | 湖南特能博世科技有限公司 | Material model extraction method and device and computer equipment |
CN113626561A (en) * | 2021-08-16 | 2021-11-09 | 深圳市云采网络科技有限公司 | Component model identification method, device, medium and equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101564A (en) * | 2006-07-07 | 2008-01-09 | 上海晨兴电子科技有限公司 | Automatic identification method for flash memory type of product |
CN101576910A (en) * | 2009-05-31 | 2009-11-11 | 北京学之途网络科技有限公司 | Method and device for identifying product naming entity automatically |
CN102033950A (en) * | 2010-12-23 | 2011-04-27 | 哈尔滨工业大学 | Construction method and identification method of automatic electronic product named entity identification system |
CN104408173A (en) * | 2014-12-11 | 2015-03-11 | 焦点科技股份有限公司 | Method for automatically extracting kernel keyword based on B2B platform |
CN110688855A (en) * | 2019-09-29 | 2020-01-14 | 山东师范大学 | Chinese medical entity identification method and system based on machine learning |
-
2021
- 2021-02-07 CN CN202110177411.8A patent/CN112818693A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101564A (en) * | 2006-07-07 | 2008-01-09 | 上海晨兴电子科技有限公司 | Automatic identification method for flash memory type of product |
CN101576910A (en) * | 2009-05-31 | 2009-11-11 | 北京学之途网络科技有限公司 | Method and device for identifying product naming entity automatically |
CN102033950A (en) * | 2010-12-23 | 2011-04-27 | 哈尔滨工业大学 | Construction method and identification method of automatic electronic product named entity identification system |
CN104408173A (en) * | 2014-12-11 | 2015-03-11 | 焦点科技股份有限公司 | Method for automatically extracting kernel keyword based on B2B platform |
CN110688855A (en) * | 2019-09-29 | 2020-01-14 | 山东师范大学 | Chinese medical entity identification method and system based on machine learning |
Non-Patent Citations (2)
Title |
---|
张朝胜;郭剑毅;线岩团;余正涛;雷春雅;王海雄;: "基于条件随机场的英文产品命名实体识别", 计算机工程与科学, no. 06, pages 115 - 117 * |
谷川;周宏宇;于江德;: "融合多特征的中文产品命名实体识别", 科学技术与工程, no. 31, pages 9417 - 9421 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113609279A (en) * | 2021-08-05 | 2021-11-05 | 湖南特能博世科技有限公司 | Material model extraction method and device and computer equipment |
CN113609279B (en) * | 2021-08-05 | 2023-12-08 | 湖南特能博世科技有限公司 | Material model extraction method and device and computer equipment |
CN113626561A (en) * | 2021-08-16 | 2021-11-09 | 深圳市云采网络科技有限公司 | Component model identification method, device, medium and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110852087B (en) | Chinese error correction method and device, storage medium and electronic device | |
CN112270196B (en) | Entity relationship identification method and device and electronic equipment | |
CN113486189B (en) | Open knowledge graph mining method and system | |
CN107688630B (en) | Semantic-based weakly supervised microbo multi-emotion dictionary expansion method | |
CN109472022B (en) | New word recognition method based on machine learning and terminal equipment | |
CN111767725A (en) | Data processing method and device based on emotion polarity analysis model | |
CN110348012B (en) | Method, device, storage medium and electronic device for determining target character | |
CN112989208B (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN112613315B (en) | Text knowledge automatic extraction method, device, equipment and storage medium | |
CN111144079A (en) | Method and device for intelligently acquiring learning resources, printer and storage medium | |
CN109522396B (en) | Knowledge processing method and system for national defense science and technology field | |
CN112818693A (en) | Automatic extraction method and system for electronic component model words | |
CN110298041A (en) | Rubbish text filter method, device, electronic equipment and storage medium | |
CN114970502B (en) | Text error correction method applied to digital government | |
CN107783958B (en) | Target statement identification method and device | |
Sagcan et al. | Toponym recognition in social media for estimating the location of events | |
CN111681731A (en) | Method for automatically marking colors of inspection report | |
CN109657207B (en) | Formatting processing method and processing device for clauses | |
CN116306506A (en) | Intelligent mail template method based on content identification | |
Bladier et al. | German and French neural supertagging experiments for LTAG parsing | |
CN115130475A (en) | Extensible universal end-to-end named entity identification method | |
CN111341404B (en) | Electronic medical record data set analysis method and system based on ernie model | |
CN111046657B (en) | Method, device and equipment for realizing text information standardization | |
Wong et al. | iSentenizer: An incremental sentence boundary classifier | |
CN107590163A (en) | The methods, devices and systems of text feature selection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |