CN112818693A - Automatic extraction method and system for electronic component model words - Google Patents

Automatic extraction method and system for electronic component model words Download PDF

Info

Publication number
CN112818693A
CN112818693A CN202110177411.8A CN202110177411A CN112818693A CN 112818693 A CN112818693 A CN 112818693A CN 202110177411 A CN202110177411 A CN 202110177411A CN 112818693 A CN112818693 A CN 112818693A
Authority
CN
China
Prior art keywords
model
words
word
training
manufacturer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110177411.8A
Other languages
Chinese (zh)
Inventor
樊芳华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Sekorm Component Network Co Ltd
Original Assignee
Shenzhen Sekorm Component Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Sekorm Component Network Co Ltd filed Critical Shenzhen Sekorm Component Network Co Ltd
Priority to CN202110177411.8A priority Critical patent/CN112818693A/en
Publication of CN112818693A publication Critical patent/CN112818693A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an automatic extraction method and system of electronic component model words, wherein the method comprises the following steps: constructing a model column name dictionary and training a model word presumption model according to the training documents; and obtaining a document to be extracted, and performing matching extraction of model words in the table according to the model column name dictionary, and/or performing inference extraction of model words in the text according to the model word inference model. By implementing the method, the model words of the components can be automatically extracted from massive electronic component data of electronic manufacturers, the labor input is reduced, the extraction accuracy is improved, and the e-commerce system experience is improved.

Description

Automatic extraction method and system for electronic component model words
Technical Field
The invention relates to the technical field of computer application, in particular to an automatic extraction method and system for electronic component model words.
Background
With the continuous development of social industrialization, the electronic industry is also developed vigorously, various electronic components are generated to meet the requirements of social industrialization, and a large amount of electronic component data are generated, wherein a large number of component models and specifications are recorded in the data and need to be extracted and used as keywords for searching corresponding components by users of e-commerce systems. At present, the industry does not have an effective method for automatically extracting models from massive articles, but relies on the identification, marking and extraction of human eyes, which not only consumes time and labor, but also causes a large number of model extraction errors due to different personnel qualities during extraction, influences the accuracy of user search of an e-commerce system, the inference of user search intention and the commodity recommendation effect, and causes poor user experience.
Disclosure of Invention
The invention aims to solve the technical problem of providing an automatic extraction method and system of electronic component model words aiming at the defects of the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows: an automatic extraction method for constructing electronic component model words comprises the following steps:
s1: constructing a model column name dictionary and training a model word presumption model according to the training documents;
s2: and obtaining a document to be extracted, and performing matching extraction of model words in the table according to the model column name dictionary, and/or performing inference extraction of model words in the text according to the model word inference model.
Preferably, in the method for automatically extracting a model word of an electronic component according to the present invention, the step S1 includes:
s11: extracting text data and/or table data from at least one training document; the training documents are marked type words documents;
s12: judging whether table data exists, if so, executing step S13, otherwise, executing step S14;
s13: constructing the model column name dictionary through empirical conjecture according to the header data of the table;
s14: and utilizing a word segmentation device to segment words of the text data, acquiring the model words with the marks after the words are segmented, and inputting the model words into the model word presumption model for recognition training.
Preferably, in the method for automatically extracting a model word of an electronic component according to the present invention, the step S2 includes:
s21: extracting text data and/or table data from at least one document to be extracted;
s22: judging whether table data exists, if so, executing step S23, otherwise, executing step S24;
s23: matching model words under the header in the table data according to the model column name dictionary, and extracting the model words in the table;
s24: and utilizing a word segmentation device to segment words of the text data, and carrying out speculation on whether the segmented words are model words or not according to the model word speculation model so as to extract the model words in the text.
Preferably, in the method for automatically extracting model words of electronic components according to the present invention, the model word inference model includes at least one single-group manufacturer model word inference model for inferring model words of a single group of manufacturers and/or all manufacturer model word inference models for inferring model words of all manufacturers;
the step S14 includes: acquiring corresponding text data according to manufacturer attributes to which training documents belong, segmenting the text data by using a word segmentation device, acquiring model words with marks after segmentation, and inputting the model words into the single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training;
and/or performing word segmentation on the text data by using a word segmentation device, acquiring the model words with the marks after word segmentation, and inputting the model words into the model word inference models of all manufacturers for recognition training;
the step S24 includes:
acquiring corresponding text data according to manufacturer attributes to which the documents to be extracted belong, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of the manufacturer according to the single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer in the text;
and/or performing word segmentation on the text data by using a word segmentation device, and performing inference on whether the word after word segmentation is a model word according to the model word inference model of all manufacturers, so as to extract the model word in the text.
Preferably, in the method for automatically extracting model words of electronic components according to the present invention, the method further includes: and discarding the picture data and/or the messy code data in the extraction process.
Preferably, in the method for automatically extracting model words of electronic components according to the present invention, the method further includes:
s3: and storing the extracted model words into a model word library, and marking the model words in the training document according to the model word library.
The invention also constructs an automatic extraction system of the electronic component model words, which comprises the following steps:
the training module is used for constructing a model column name dictionary and training a model word presumption model in advance according to a training document;
and the extraction module is used for obtaining the document to be extracted, matching and extracting model words in the table according to the model column name dictionary, and/or speculating and extracting the model words in the text according to the model word speculation model.
Preferably, in the automatic extraction system of electronic component type words according to the present invention, the training module includes:
the training data module is used for extracting text data and/or table data from at least one training document; the training documents are marked type words documents;
the training judgment module is used for judging whether table data exist or not, if so, executing the dictionary construction module, and if not, executing the model training module;
the dictionary construction module is used for constructing the model column name dictionary through empirical conjecture according to the header data of the table;
and the model training module is used for segmenting the text data by using the word segmenter, acquiring the model words with the marks after the word segmentation, and inputting the model words into the model word presumption model for recognition training.
Preferably, in the automatic extraction system of electronic component type words according to the present invention, the extraction module includes:
the data extraction module is used for extracting text data and/or table data from at least one document to be extracted;
the extraction judging module is used for judging whether table data exists or not, if so, the table extraction module is executed, and if not, the text extraction module is executed;
the table extraction module is used for matching model words under the table head in the table data according to the model column name dictionary and extracting the model words in the table;
and the text extraction module is used for segmenting the text data by using a word segmentation device, and extracting the model words in the text according to the model word presumption model for presuming whether the segmented words are model words.
Preferably, in the automatic extraction system of electronic component model words described in the present invention, the model word inference model includes at least one single-group manufacturer model word inference model for inferring model words of a single group of manufacturers and/or all manufacturer model word inference models for inferring model words of all manufacturers;
the model training module comprises:
the single-group manufacturer model training module is used for acquiring corresponding text data according to manufacturer attributes to which training documents belong, segmenting the text data by using a word segmentation device, acquiring model words with marks after segmentation, and inputting the model words into the single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training;
and/or all manufacturer model training modules are used for segmenting the text data by utilizing a word segmentation device, acquiring the model words with the marks after the word segmentation, and inputting the model words into all manufacturer model word presumption models for recognition training;
the text extraction module comprises:
the single-group manufacturer text extraction module is used for acquiring corresponding text data according to manufacturer attributes to which the documents to be extracted belong, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of the manufacturer according to the single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer from the text;
and/or all manufacturer text extraction modules are used for utilizing word segmenters to segment words of text data, and extracting model words in the text according to the inference of whether the segmented words are model words or not by the aid of all manufacturer model word inference models.
By implementing the invention, the following beneficial effects are achieved:
according to the method, the model column name dictionary is built according to the training document, the model word presumption model is trained, then the document to be extracted is obtained, the model words in the table are matched and extracted according to the model column name dictionary, and/or the model words in the text are presumed and extracted according to the model word presumption model, so that the model words of the components can be automatically extracted from mass electronic component data of electronic manufacturers, the labor input is reduced, the extraction accuracy is improved, and the e-commerce system experience is improved.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a flow chart of a method for automatically extracting model words of electronic components according to the present invention;
FIG. 2 is a block diagram of an automatic extraction system for electronic component type words according to the present invention;
FIG. 3 is a general computation flow diagram for the Attention Model.
Detailed Description
For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
It should be noted that the flow charts shown in the drawings are only exemplary and do not necessarily include all the contents and operations/steps, nor do they necessarily have to be executed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
A first embodiment, as shown in fig. 1, discloses an automatic extraction method of electronic component type words, which includes the following steps:
step S1: constructing a model column name dictionary and training a model word presumption model according to the training documents;
step S2: and obtaining a document to be extracted, and performing matching extraction of model words in the table according to the model column name dictionary, and/or performing inference extraction of model words in the text according to the model word inference model.
Specifically, in this embodiment, the step S1 includes:
step S11: text data and/or tabular data is extracted from at least one training document. The training documents are the documents with marked model words, namely the documents with extracted model words. In some embodiments, the training document is a PDF formatted document of electronic component content obtained from the CMS system of the e-commerce system. In order to clear the junk data, filtering and clearing picture data and/or messy code data caused by PDF format problems in the process of extracting text data and/or form data, and/or correcting the data with disordered formats so as to avoid influencing the correctness of the system and keep the data consistent with the visual observation as much as possible;
step S12: judging whether table data exists, if so, executing step S13, otherwise, executing step S14;
step S13: and constructing a model column name dictionary through empirical presumption according to the header data of the table. In some embodiments, since most data types appear in the header of the table, the type words existing in the table can be summarized according to the header data of the table based on human experience, i.e. empirical conjecture, so as to form a type column name dictionary;
step S14: and performing word segmentation on the text data by using a word segmentation device, acquiring the model words with the marks after the word segmentation, and inputting the model words into a model word presumption model for recognition training. In some embodiments, a jieba tokenizer may be used to tokenize the Chinese text data, and a stanford tokenizer may be used to tokenize the English text data to obtain Chinese and English text data.
It should be noted that, for the intelligent extraction of model words in a text, the invention adopts a Named Entity Recognition (NER) method in natural language processing to extract form headers, segment text words, construct a model list name dictionary and establish a model word inference model in a large number of training documents with model words extracted and marked, and performs model word extraction matching and inference for new data (documents to be extracted). Therefore, step S1 further includes: and constructing a model column name dictionary and training a model word inference model according to the training document by using a specific named entity recognition method. The method specifically comprises the following steps: BilSTM-CRF + Attention.
The BilSTM-CRF is bidirectional Long-Short Term Memory artificial Neural network and conditional random field, and the LSTM is called Long Short-Term Memory, which is one of RNN (Current Neural network). LSTM is well suited for modeling time series data, such as text data, due to its design features.
BilSTM is an abbreviation of Bi-directional Long Short-Term Memory, and is formed by combining forward LSTM and backward LSTM. Both are often used to model context information in natural language processing tasks.
CRF is a serialization labeling algorithm (sequence labeling algorithm) that receives an input sequence such as X ═ X (X ═ X)1+x2,…,xn) And outputting the target sequence Y ═ Y1+y2,…,yn) It can also be considered as a seq2seq model. The sequence is denoted here by capital X, Y. For example, in the part-of-speech tagging task, the input sequence is a string of words, and the output sequence is the corresponding part-of-speech.
The Attention is an Attention Model, which actually simulates an Attention Model of the human brain, for example, when a picture is viewed, although the whole picture can be seen, when the picture is deeply and carefully observed, only a small block of the picture is focused on the eyes, and at this time, the human brain mainly focuses on the small block of the picture, that is, the Attention of the human brain to the whole picture is not balanced at this time, and the picture is distinguished by certain weight, which is the core idea of the Attention Model in deep learning. The overall calculation flow is shown in fig. 3.
In this embodiment, the step S2 includes:
step S21: extracting text data and/or table data from at least one document to be extracted; the document to be extracted is a new unmarked model word, namely a document without model words.
Step S22: judging whether table data exists, if so, executing step S23, otherwise, executing step S24;
step S23: matching model words under a header in table data according to the model column name dictionary, and extracting the model words in the table;
step S24: and utilizing a word segmentation device to segment words of the text data, and carrying out speculation on whether the segmented words are model words or not according to the model word speculation model so as to extract the model words in the text. After the model word presumption model is trained in the step S1, the general attributes and the judgment rules of the model words can be obtained, and through continuous training, the general attributes and the judgment rules can be further improved, and the presumption accuracy of the model word presumption model is improved.
In some embodiments, the training document and the document to be extracted may generally include documents of multiple vendors, each of which may have corresponding vendor attributes, such as identification numbers, etc., and may sometimes need to accurately guess model words of a certain vendor or some similar vendors, so that the model word inference model includes at least one single-set vendor model word inference model for inferring model words of a single set of vendors and/or one all vendor model word inference model for inferring model words of all vendors. Wherein, each single group of manufacturer model word inference model respectively corresponds to one manufacturer or a plurality of similar manufacturers. And the word inference model of all manufacturers trains and infers the documents of all manufacturers and does not train and infer the documents according to manufacturers.
Accordingly, the step S14 includes: obtaining corresponding text data according to manufacturer attributes to which the training documents belong, utilizing a word segmentation device to segment the text data, obtaining model words with marks after word segmentation, and inputting the model words into a single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training.
And/or performing word segmentation on the text data by using a word segmentation device, acquiring the model words with the marks after word segmentation, and inputting the model words into the model word inference models of all manufacturers for recognition training;
accordingly, the step S24 includes:
acquiring corresponding text data according to manufacturer attributes to which a document to be extracted belongs, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of a manufacturer according to a single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer from the text;
and/or performing word segmentation on the text data by using a word segmentation device, and performing inference on whether the word after word segmentation is a model word according to all manufacturer model word inference models, so as to extract the model word in the text.
In this embodiment, the method for automatically extracting the type words of the electronic component further includes:
step S3: and storing the extracted model words into a model word library, and marking the model words in the training document according to the model word library. The method and the device continuously extract the model words of the new document to be extracted to obtain the new model words and mark the new model words in the training document, thereby continuously perfecting the construction of the model column name dictionary and the training of the model word presumption model and ensuring that the accuracy of subsequent matching extraction and presumption extraction is higher. In some embodiments, before step S3, the method further includes aggregating the model words extracted by the model column name dictionary and the model word inference model, and storing the aggregated model words into a model word library.
In some embodiments, when the model words extracted by a single set of vendor model word inference models and all vendor model word inference models are the same, they may be merged and then stored in a model lexicon. In other embodiments, the words which are extracted by mistake and do not belong to the model words can be filtered through summarizing and filtering before being stored in the model word library, and the filtering can be automatic filtering or manual filtering.
A second embodiment, as shown in fig. 2, discloses an automatic extraction system for electronic component type words, including:
the training module is used for constructing a model column name dictionary and training a model word presumption model in advance according to a training document;
and the extraction module is used for obtaining the document to be extracted, matching and extracting the model words in the table according to the model column name dictionary, and/or speculating and extracting the model words in the text according to the model word speculation model.
Specifically, in this embodiment, the training module includes:
the training data module is used for extracting text data and/or table data from at least one training document; the training documents are the documents with marked model words, namely the documents with extracted model words. In some embodiments, the training document is a PDF formatted document of electronic component content obtained from the CMS system of the e-commerce system. Preferably, in order to clean the junk data, the system further comprises a cleaning module, which is used for filtering and cleaning the picture data and/or the messy code data caused by the PDF format problem in the process of extracting the text data and/or the form data, and/or correcting the data with disordered format so as to avoid affecting the correctness of the system and keep the data consistent with the observation of naked eyes as much as possible;
the training judgment module is used for judging whether table data exist or not, if so, executing the dictionary construction module, and if not, executing the model training module;
the dictionary building module is used for building a model column name dictionary through empirical conjecture according to the header data of the table; in some embodiments, since most data types appear in the header of the table, the type words existing in the table can be summarized according to the header data of the table based on human experience, i.e. empirical conjecture, so as to form a type column name dictionary;
and the model training module is used for segmenting the text data by using the word segmenter, acquiring the model words with the marks after the word segmentation, and inputting the model words into the model word presumption model for recognition training. In some embodiments, a jieba tokenizer may be used to tokenize the Chinese text data, and a stanford tokenizer may be used to tokenize the English text data to obtain Chinese and English text data.
It should be noted that, for the intelligent extraction of model words in a text, the invention adopts a Named Entity Recognition (NER) method in natural language processing to extract form headers, segment text words, construct a model list name dictionary and establish a model word inference model in a large number of training documents with model words extracted and marked, and performs model word extraction matching and inference for new data (documents to be extracted). Therefore, the training module is further used for constructing the model column name dictionary and training the model word presumption model according to the training document by utilizing a specific named entity recognition method. The method specifically comprises the following steps: BilSTM-CRF + Attention.
The BilSTM-CRF is bidirectional Long-Short Term Memory artificial Neural network and conditional random field, and the LSTM is called Long Short-Term Memory, which is one of RNN (Current Neural network). LSTM is well suited for modeling time series data, such as text data, due to its design features.
BilSTM is an abbreviation of Bi-directional Long Short-Term Memory, and is formed by combining forward LSTM and backward LSTM. Both are often used to model context information in natural language processing tasks.
CRF is a serialization labeling algorithm (sequence labeling algorithm) that receives an input sequence such as X ═ X (X ═ X)1+x2,…,xn) And outputting the target sequence Y ═ Y1+y2,…,yn) It can also be considered as a seq2seq model. The sequence is denoted here by capital X, Y. For example, in the part-of-speech tagging task, the input sequence is a string of words, and the output sequence is the corresponding part-of-speech.
The Attention is an Attention Model, which actually simulates an Attention Model of the human brain, for example, when a picture is viewed, although the whole picture can be seen, when the picture is deeply and carefully observed, only a small block of the picture is focused on the eyes, and at this time, the human brain mainly focuses on the small block of the picture, that is, the Attention of the human brain to the whole picture is not balanced at this time, and the picture is distinguished by certain weight, which is the core idea of the Attention Model in deep learning. The overall calculation flow is shown in fig. 3.
In this embodiment, the extracting module includes:
the data extraction module is used for extracting text data and/or table data from at least one document to be extracted; the document to be extracted is a new unmarked model word, namely a document without model words.
The extraction judging module is used for judging whether table data exists or not, if so, the table extraction module is executed, and if not, the text extraction module is executed;
the table extraction module is used for matching model words under the table head in the table data according to the model column name dictionary and extracting the model words in the table;
and the text extraction module is used for segmenting the text data by using the word segmenter, and extracting the model words in the text according to the model word presumption model for presuming whether the segmented words are model words. After the model word presumption model is trained by the training module, the general attributes and the judgment rules of the model words can be obtained, the general attributes and the judgment rules can be further improved through continuous training, the presumption accuracy of the model word presumption model is improved, and when the word after word segmentation is obtained, whether the word belongs to the model words or not is presumed according to the obtained general attributes and the judgment rules of the model words, so that the model words in the text are extracted.
In some embodiments, the training document and the document to be extracted may generally include documents of multiple vendors, each of which may have corresponding vendor attributes, such as identification numbers, etc., and may sometimes need to accurately guess model words of a certain vendor or some similar vendors, so that the model word inference model includes at least one single-set vendor model word inference model for inferring model words of a single set of vendors and/or one all vendor model word inference model for inferring model words of all vendors. Wherein, each single group of manufacturer model word inference model respectively corresponds to one manufacturer or a plurality of similar manufacturers. And the word inference model of all manufacturers trains and infers the documents of all manufacturers and does not train and infer the documents according to manufacturers.
Accordingly, the model training module comprises:
the single-group manufacturer model training module is used for acquiring corresponding text data according to manufacturer attributes to which training documents belong, segmenting the text data by using a word segmentation device, acquiring model words with marks after segmentation, and inputting the model words into a single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training;
and/or all manufacturer model training modules are used for segmenting the text data by utilizing a word segmentation device, acquiring the model words with the marks after the word segmentation, and inputting the model words into all manufacturer model word presumption models for recognition training;
the text extraction module comprises:
the single-group manufacturer text extraction module is used for acquiring corresponding text data according to manufacturer attributes to which the documents to be extracted belong, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of the manufacturer according to a single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer in the text;
and/or all manufacturer text extraction modules are used for utilizing the word segmentation device to segment words of the text data, and extracting model words in the text according to the inference of whether the segmented words are model words or not by all manufacturer model word inference models.
In this embodiment, the automatic extraction system for electronic component type words further includes:
and the marking module is used for storing the extracted model words into a model word library and marking the model words in the training document according to the model word library. The method and the device continuously extract the model words of the new document to be extracted to obtain the new model words and mark the new model words in the training document, thereby continuously perfecting the construction of the model column name dictionary and the training of the model word presumption model and ensuring that the accuracy of subsequent matching extraction and presumption extraction is higher. In some embodiments, the model word library is further configured to store the model words extracted by the model column name dictionary and the model word inference model into the model word library after the model words are aggregated.
In some embodiments, when the model words extracted by a single set of vendor model word inference models and all vendor model word inference models are the same, they may be merged and then stored in a model lexicon. In some other embodiments, the system further includes a filtering module, configured to filter out words that are extracted by mistake and do not belong to the model words by summarizing and filtering before storing in the model word bank, where the filtering may be automatic filtering or manual filtering.
By implementing the invention, the following beneficial effects are achieved:
according to the method, the model column name dictionary is built according to the training document, the model word presumption model is trained, then the document to be extracted is obtained, the model words in the table are matched and extracted according to the model column name dictionary, and/or the model words in the text are presumed and extracted according to the model word presumption model, so that the model words of the components can be automatically extracted from mass electronic component data of electronic manufacturers, the labor input is reduced, the extraction accuracy is improved, and the e-commerce system experience is improved.
It is to be understood that the foregoing examples, while indicating the preferred embodiments of the invention, are given by way of illustration and description, and are not to be construed as limiting the scope of the invention; it should be noted that, for those skilled in the art, the above technical features can be freely combined, and several changes and modifications can be made without departing from the concept of the present invention, which all belong to the protection scope of the present invention; therefore, all equivalent changes and modifications made within the scope of the claims of the present invention should be covered by the claims of the present invention.

Claims (10)

1. An automatic extraction method of electronic component model words is characterized by comprising the following steps:
s1: constructing a model column name dictionary and training a model word presumption model according to the training documents;
s2: and obtaining a document to be extracted, and performing matching extraction of model words in the table according to the model column name dictionary, and/or performing inference extraction of model words in the text according to the model word inference model.
2. The method for automatically extracting words of electronic component types according to claim 1, wherein the step S1 includes:
s11: extracting text data and/or table data from at least one training document; the training documents are marked type words documents;
s12: judging whether table data exists, if so, executing step S13, otherwise, executing step S14;
s13: constructing the model column name dictionary through empirical conjecture according to the header data of the table;
s14: and utilizing a word segmentation device to segment words of the text data, acquiring the model words with the marks after the words are segmented, and inputting the model words into the model word presumption model for recognition training.
3. The method for automatically extracting words of electronic component types according to claim 2, wherein the step S2 includes:
s21: extracting text data and/or table data from at least one document to be extracted;
s22: judging whether table data exists, if so, executing step S23, otherwise, executing step S24;
s23: matching model words under the header in the table data according to the model column name dictionary, and extracting the model words in the table;
s24: and utilizing a word segmentation device to segment words of the text data, and carrying out speculation on whether the segmented words are model words or not according to the model word speculation model so as to extract the model words in the text.
4. The automatic extraction method of the electronic component model words as claimed in claim 3, wherein the model word inference model includes at least one single-set manufacturer model word inference model for inferring model words of a single set of manufacturers and/or all manufacturer model word inference models for inferring model words of all manufacturers;
the step S14 includes: acquiring corresponding text data according to manufacturer attributes to which training documents belong, segmenting the text data by using a word segmentation device, acquiring model words with marks after segmentation, and inputting the model words into the single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training;
and/or performing word segmentation on the text data by using a word segmentation device, acquiring the model words with the marks after word segmentation, and inputting the model words into the model word inference models of all manufacturers for recognition training;
the step S24 includes:
acquiring corresponding text data according to manufacturer attributes to which the documents to be extracted belong, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of the manufacturer according to the single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer in the text;
and/or performing word segmentation on the text data by using a word segmentation device, and performing inference on whether the word after word segmentation is a model word according to the model word inference model of all manufacturers, so as to extract the model word in the text.
5. The method for automatically extracting the type words of the electronic components as claimed in claim 3, further comprising: and discarding the picture data and/or the messy code data in the extraction process.
6. The method for automatically extracting the type words of the electronic components as claimed in claim 1, further comprising:
s3: and storing the extracted model words into a model word library, and marking the model words in the training document according to the model word library.
7. An automatic extraction system of electronic component model words is characterized by comprising:
the training module is used for constructing a model column name dictionary and training a model word presumption model in advance according to a training document;
and the extraction module is used for obtaining the document to be extracted, matching and extracting model words in the table according to the model column name dictionary, and/or speculating and extracting the model words in the text according to the model word speculation model.
8. The system for automatically extracting words of electronic component types according to claim 7, wherein the training module comprises:
the training data module is used for extracting text data and/or table data from at least one training document; the training documents are marked type words documents;
the training judgment module is used for judging whether table data exist or not, if so, executing the dictionary construction module, and if not, executing the model training module;
the dictionary construction module is used for constructing the model column name dictionary through empirical conjecture according to the header data of the table;
and the model training module is used for segmenting the text data by using the word segmenter, acquiring the model words with the marks after the word segmentation, and inputting the model words into the model word presumption model for recognition training.
9. The system for automatically extracting words of electronic component types according to claim 8, wherein the extraction module comprises:
the data extraction module is used for extracting text data and/or table data from at least one document to be extracted;
the extraction judging module is used for judging whether table data exists or not, if so, the table extraction module is executed, and if not, the text extraction module is executed;
the table extraction module is used for matching model words under the table head in the table data according to the model column name dictionary and extracting the model words in the table;
and the text extraction module is used for segmenting the text data by using a word segmentation device, and extracting the model words in the text according to the model word presumption model for presuming whether the segmented words are model words.
10. The system for automatically extracting model words of electronic components according to claim 9, wherein the model word inference model includes at least one single-set manufacturer model word inference model for inferring model words of a single set of manufacturers and/or all manufacturer model word inference models for inferring model words of all manufacturers;
the model training module comprises:
the single-group manufacturer model training module is used for acquiring corresponding text data according to manufacturer attributes to which training documents belong, segmenting the text data by using a word segmentation device, acquiring model words with marks after segmentation, and inputting the model words into the single-group manufacturer model word presumption model corresponding to the manufacturer attributes for recognition training;
and/or all manufacturer model training modules are used for segmenting the text data by utilizing a word segmentation device, acquiring the model words with the marks after the word segmentation, and inputting the model words into all manufacturer model word presumption models for recognition training;
the text extraction module comprises:
the single-group manufacturer text extraction module is used for acquiring corresponding text data according to manufacturer attributes to which the documents to be extracted belong, segmenting the text data by using a word segmentation device, and performing inference on whether segmented words are model words of the manufacturer according to the single-group manufacturer model word inference model corresponding to the manufacturer attributes to extract model words of the manufacturer from the text;
and/or all manufacturer text extraction modules are used for utilizing word segmenters to segment words of text data, and extracting model words in the text according to the inference of whether the segmented words are model words or not by the aid of all manufacturer model word inference models.
CN202110177411.8A 2021-02-07 2021-02-07 Automatic extraction method and system for electronic component model words Pending CN112818693A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110177411.8A CN112818693A (en) 2021-02-07 2021-02-07 Automatic extraction method and system for electronic component model words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110177411.8A CN112818693A (en) 2021-02-07 2021-02-07 Automatic extraction method and system for electronic component model words

Publications (1)

Publication Number Publication Date
CN112818693A true CN112818693A (en) 2021-05-18

Family

ID=75864680

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110177411.8A Pending CN112818693A (en) 2021-02-07 2021-02-07 Automatic extraction method and system for electronic component model words

Country Status (1)

Country Link
CN (1) CN112818693A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609279A (en) * 2021-08-05 2021-11-05 湖南特能博世科技有限公司 Material model extraction method and device and computer equipment
CN113626561A (en) * 2021-08-16 2021-11-09 深圳市云采网络科技有限公司 Component model identification method, device, medium and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101564A (en) * 2006-07-07 2008-01-09 上海晨兴电子科技有限公司 Automatic identification method for flash memory type of product
CN101576910A (en) * 2009-05-31 2009-11-11 北京学之途网络科技有限公司 Method and device for identifying product naming entity automatically
CN102033950A (en) * 2010-12-23 2011-04-27 哈尔滨工业大学 Construction method and identification method of automatic electronic product named entity identification system
CN104408173A (en) * 2014-12-11 2015-03-11 焦点科技股份有限公司 Method for automatically extracting kernel keyword based on B2B platform
CN110688855A (en) * 2019-09-29 2020-01-14 山东师范大学 Chinese medical entity identification method and system based on machine learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101564A (en) * 2006-07-07 2008-01-09 上海晨兴电子科技有限公司 Automatic identification method for flash memory type of product
CN101576910A (en) * 2009-05-31 2009-11-11 北京学之途网络科技有限公司 Method and device for identifying product naming entity automatically
CN102033950A (en) * 2010-12-23 2011-04-27 哈尔滨工业大学 Construction method and identification method of automatic electronic product named entity identification system
CN104408173A (en) * 2014-12-11 2015-03-11 焦点科技股份有限公司 Method for automatically extracting kernel keyword based on B2B platform
CN110688855A (en) * 2019-09-29 2020-01-14 山东师范大学 Chinese medical entity identification method and system based on machine learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张朝胜;郭剑毅;线岩团;余正涛;雷春雅;王海雄;: "基于条件随机场的英文产品命名实体识别", 计算机工程与科学, no. 06, pages 115 - 117 *
谷川;周宏宇;于江德;: "融合多特征的中文产品命名实体识别", 科学技术与工程, no. 31, pages 9417 - 9421 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609279A (en) * 2021-08-05 2021-11-05 湖南特能博世科技有限公司 Material model extraction method and device and computer equipment
CN113609279B (en) * 2021-08-05 2023-12-08 湖南特能博世科技有限公司 Material model extraction method and device and computer equipment
CN113626561A (en) * 2021-08-16 2021-11-09 深圳市云采网络科技有限公司 Component model identification method, device, medium and equipment

Similar Documents

Publication Publication Date Title
CN110852087B (en) Chinese error correction method and device, storage medium and electronic device
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN113486189B (en) Open knowledge graph mining method and system
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN109472022B (en) New word recognition method based on machine learning and terminal equipment
CN111767725A (en) Data processing method and device based on emotion polarity analysis model
CN110348012B (en) Method, device, storage medium and electronic device for determining target character
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN112613315B (en) Text knowledge automatic extraction method, device, equipment and storage medium
CN111144079A (en) Method and device for intelligently acquiring learning resources, printer and storage medium
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN112818693A (en) Automatic extraction method and system for electronic component model words
CN110298041A (en) Rubbish text filter method, device, electronic equipment and storage medium
CN114970502B (en) Text error correction method applied to digital government
CN107783958B (en) Target statement identification method and device
Sagcan et al. Toponym recognition in social media for estimating the location of events
CN111681731A (en) Method for automatically marking colors of inspection report
CN109657207B (en) Formatting processing method and processing device for clauses
CN116306506A (en) Intelligent mail template method based on content identification
Bladier et al. German and French neural supertagging experiments for LTAG parsing
CN115130475A (en) Extensible universal end-to-end named entity identification method
CN111341404B (en) Electronic medical record data set analysis method and system based on ernie model
CN111046657B (en) Method, device and equipment for realizing text information standardization
Wong et al. iSentenizer: An incremental sentence boundary classifier
CN107590163A (en) The methods, devices and systems of text feature selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination