CN117558019A

CN117558019A - Method for automatically extracting symbol map parameters from PDF format component manual

Info

Publication number: CN117558019A
Application number: CN202410038434.4A
Authority: CN
Inventors: 吴绿; 李宁
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2024-01-11
Filing date: 2024-01-11
Publication date: 2024-02-13
Anticipated expiration: 2044-01-11
Also published as: CN117558019B

Abstract

The invention relates to the technical field of electronic design automation, and discloses a method for automatically extracting symbol map parameters from a PDF format component manual, which comprises a PDF format component manual preprocessing module, an editable component symbol map parameter extracting module, a table type BGA image component symbol map parameter identifying and extracting module, a non-table type BGA image component symbol map parameter identifying and extracting module and a symbol map parameter format converting module; and the symbol map parameter conversion module is used for converting the symbol map parameters obtained by the editable component symbol map parameter extraction module, the table type BGA image component symbol map parameter identification and extraction module and the non-table type BGA image component symbol map parameter identification and extraction module into a symbol map file format of the PCB EDA tool. The method for automatically extracting the symbol image parameters from the PDF format component manual realizes high automation of component model library construction, has high accuracy and greatly improves the efficiency of library construction work.

Description

Method for automatically extracting symbol map parameters from PDF format component manual

Technical Field

The invention relates to the technical field of electronic design automation, in particular to a method for automatically extracting symbol and map parameters from a PDF format component manual.

Background

In the schematic design tool and the component model selection tool of the PCB EDA, a component schematic database is indispensable. Because the components are various in types and multiple types exist in the same type, the total number of the types can reach tens of millions. Besides the PCB EDA tool, many large and medium-sized electronic enterprises can build own component libraries, electronic component selling platforms and PCB chip factories, and the libraries or the commodity information of the large and medium-sized electronic enterprises can be built or updated.

The current symbol map model database construction work basically relies on manual operation to read the component data manual, and relevant parameters are input into a database one by one after the component symbol map is found. These manuals are basically in PDF format, tens of pages, thousands of pages, and the symbol and diagram of the components are various, and the drawing specifications of different manufacturers are different. Therefore, the job of building a warehouse needs to use a large amount of manpower to do simple repeated work, and has low efficiency, and errors are difficult to find.

Although a small amount of semi-automatic symbol map library building tools appear in the market at present, the component data manual still needs to be manually turned page by page, then the category judgment and the frame selection are manually carried out on the symbol map, then the general OCR tool is used for recognition, and finally the correction input is manually carried out. The tool ignores the characteristic that the symbol diagrams in most of the current component manuals are editable and readable, and adopts a method of firstly performing frame selection screenshot on the symbol diagrams in the manual and then calling a general OCR tool for recognition. Because the component parameters contain a large number of combined symbols such as upper score lines, lower score lines, oblique lines and the like, the recognition accuracy of the general OCR tool is not high, and a large number of false detection and missing detection can occur, so that a large amount of manpower is still required to be used for correcting one by one.

Disclosure of Invention

The invention aims to overcome the defects of the technology, and provides a method for automatically extracting the symbol and map parameters from the PDF format component manual, so that the high automation of component model library construction is realized, the accuracy is high, and the efficiency of library construction work is greatly improved.

In order to achieve the above purpose, the method for automatically extracting the symbol map parameters from the PDF format component manual comprises a PDF format component manual preprocessing module, an editable component symbol map parameter extracting module, a table type BGA image type component symbol map parameter identifying and extracting module, a non-table type BGA image type component symbol map parameter identifying and extracting module and a symbol map parameter format converting module;

the PDF format component manual preprocessing module adopts keyword positioning and electronic component symbol diagram classification recognition methods to process all component symbol diagrams in the PDF format component manual, recognize the types of the component symbol diagrams, judge whether the component symbol diagrams can be edited or not and position page numbers and coordinate ranges of the component symbol diagrams;

the editable component symbol image parameter extraction module adopts a PDF content extraction tool Pdfplum to extract symbol image parameters from page numbers and coordinate ranges calibrated in the PDF format component manual preprocessing module aiming at the editable component symbol image, then uses a symbol image semantic alignment method to perform semantic alignment and sequencing on the symbol image parameters, and finally performs semantic logic restoration;

The table type BGA image type component symbol image parameter identification and extraction module adopts an image processing technology aiming at the table type BGA image type component symbol image, firstly removes color, then strengthens, then divides rectangular blocks, performs OCR (optical character recognition) on each rectangular block, extracts symbol image parameters, performs semantic alignment and sequencing on the symbol image parameters by using a symbol image semantic alignment method, and finally performs semantic logic restoration;

the non-form type BGA image component symbol image parameter identification and extraction module firstly carries out OCR identification and extracts symbol image parameters aiming at the non-form type BGA image component symbol image, then carries out semantic alignment and sequencing on the symbol image parameters by using a symbol image semantic alignment method, and finally carries out semantic logic restoration;

the symbol map parameter conversion module converts symbol map parameters obtained by the editable component symbol map parameter extraction module, the table type BGA image component symbol map parameter identification and extraction module and the non-table type BGA image component symbol map parameter identification and extraction module into a symbol map file format of a PCB EDA tool;

the electronic component symbol diagram classifying and identifying method uses a ResNet50 network, and selects various outline characteristic component picture construction data sets from PDF format component manuals of different manufacturers to train so as to identify the types of the component symbol diagrams;

The symbol map semantic alignment method is characterized in that the symbol map parameters extracted from the editable component symbol map by using Pdfpllumberer or the symbol map parameters extracted from the image class symbol map by using OCR are subjected to semantic alignment according to the symbol map type characteristics, and then the pin numbers and the pin names are ranked according to the pin numbers.

Preferably, the electronic component symbol diagram classifying and identifying method comprises the steps of determining the type number of the component symbol diagram, and selecting double-row type BGA, quadrilateral type BGA and tabular type BGA component pictures from a PDF format component manual as a data set; and adopting Resnet50 as a main structure of the classifier, building a classification network, and training the network by using the data set to obtain a classification model containing weight parameters.

Preferably, in the sign graph semantic alignment method:

if the symbol diagram of the double-row component is the symbol diagram, according to the arrangement rule of the character strings and the numbers: the character string, the number and the character string are aligned semantically, and finally, the character string, the number, the character string and the number are ordered according to the pin numbers;

if the symbol diagram of the four-sided component is the symbol diagram, according to the arrangement rule of the character strings and the numbers: the character string, the number and the character string and the number are aligned semantically, the quadrangle is rotated by 90 degrees clockwise, and the arrangement rule of the character string and the number is adopted: the character string, the number and the character string are aligned semantically, and finally, the character string, the number, the character string and the number are ordered according to the pin numbers;

If the coil type BGA component symbol diagram is, judging whether the periphery of the coil type BGA has marks of numbers and character strings, if so, aligning row and column pin numbers, then reading the pin names of the coils and sequencing according to the pin numbers; otherwise, sequentially reading character strings and numerical combinations in the coils as pin numbers according to a row sequence, sequentially reading corresponding pin names by line feed, sequentially matching the pin numbers and the names according to the adjacent relation between the pin numbers and the names of the same coil, and sequencing.

If the table type BGA component symbol diagram is, judging whether ordinal numbers and character strings exist on rows and columns of the periphery of the table type BGA; if so, matching the matching character strings with the numbers according to the rows and the columns in the graph coordinate range, and obtaining pin numbers; then reading the pin name of the table type BGA and aligning with the pin number; if the BGA is in a page-crossing table type, detecting character pin numbers in the rightmost column direction and the bottommost row direction, performing the same processing as the previous page, and finally synthesizing data of all pages and sequencing according to the pin numbers.

Preferably, when performing OCR recognition, selecting various characters appearing in a component symbol diagram in a PDF format component manual, wherein the characters comprise numerals, english letters, greek characters, description symbols of electrical characteristics and marked characters on the description symbols as preparation data; adopting Resnet50 as a basic skeleton, and building a text detection network by combining DBnet; the detected visual sequence features are extracted using a transducer and characters are predicted using an attribute mode.

Preferably, when the preprocessing module of the PDF format component manual performs preprocessing, the PDF format component manual is input, the validity of the PDF format component manual is judged, the next step is entered if the PDF format component manual is normally opened, and the next step is prompted and exited if the PDF format component manual cannot be opened; judging whether the PDF format component manual has a watermark or not, and if so, performing watermark removal processing; deleting all plain text pages in the PDF format component manual, searching the rest pages according to the keywords, and judging whether the editable table type BGA component symbol diagram exists or not, so as to obtain the start-stop page numbers of all editable table type BGA symbol diagrams in the manual; creating a Json file 'symbols_BasicInfo.json', continuously deleting all table pages in the PDF format component manual, converting the rest pages into JPG format pictures, searching and matching by adopting an electronic component symbol diagram classification recognition method to obtain all other component symbol diagrams except the editable table type BGA component symbol diagram in the PDF format component manual, determining the type of the component symbol diagram, and positioning the page number and coordinate range of the component symbol diagram; returning to the page where each element symbol diagram is located for obtaining the element symbol diagram, and judging whether the element symbol diagram can be edited or not; symbol map parameters of symbol maps of all components in the manual: the name, type, editable or not, the start-stop page number and the coordinate range of the component are written into a Symbols_BasicInfo.json file.

Preferably, when the parameter extraction module of the editable component symbol chart performs parameter extraction, reading basic information of the editable component symbol chart from a 'symbols_basic info.json' file;

if the table type BGA component is located at the first page of the table in a PDF format component manual, firstly using a PDF content extractor Pdfplum to read all characters in the table, then using the symbol graph semantic alignment method to perform semantic alignment and sequencing, then processing subsequent pages of the table type BGA page by page, sequencing all symbol graph parameters according to pin numbers, and finally sequencing the symbol graph parameters: the name of the component, the pin number and the information of the pin name are written into a 'symbols_parameters.json' file;

if the coil type BGA component/quadrilateral component/double-row type component is used, firstly using Pdfplum to read all symbol map parameters in a coordinate range, then using the symbol map semantic alignment method to perform semantic alignment and sequencing according to the characteristics of the coil type BGA component/quadrilateral component/double-row type component, then performing semantic logic repair, and finally repairing the symbol map parameters: the component name, pin number, pin name, pin direction information is written into the "symbols_parameters. Json" file.

Preferably, when the table type BGA image class symbol map parameter identifying and extracting module identifies and extracts, the basic information of a table type BGA image class symbol map is read from a "symbols_basic info. Json" file; the method comprises the steps of firstly carrying out gray processing on a picture in a coordinate range, changing a color picture into a binary picture, and then carrying out enhancement processing on characters in the picture by using an interpolation method; dividing the fine-grained rectangular blocks of the table diagram, performing OCR (optical character recognition) on each fine-grained rectangular block, and performing character recognition on each fine-grained rectangular block; judging whether the next page is a continuation of the component symbol diagram, if so, repeating the steps until the full-diagram identification is completed; and (3) carrying out semantic alignment and sequencing on the recognition result by adopting the symbol graph semantic alignment method, then carrying out semantic logic restoration, and finally carrying out symbol graph parameter adjustment: the component name, pin number, pin name information are written into the symbols_parameters.

Preferably, when the non-form type BGA image class symbol map parameter identifying and extracting module identifies and extracts, the basic information of a non-form type BGA image class symbol map is read from a "symbols_basic info. Json" file; OCR recognition is carried out to recognize all characters in the image class symbol diagram; and carrying out semantic alignment and sequencing on the identified result by adopting the symbol graph semantic alignment method, then carrying out logic restoration, and finally carrying out symbol graph parameter adjustment: the component name, pin number, pin name, pin direction information is written into the symbols_parameters.

Compared with the prior art, the invention has the following advantages:

1. the related parameters of all components in the PDF format component manual can be automatically extracted by utilizing the image processing, natural language understanding and deep learning technology, including various parameters required by component symbol diagram library establishment such as component names, types, pin numbers, pin names and the like, and the accuracy is high, so that the efficiency of the electronic component symbol diagram library establishment work can be greatly improved;

2. all component symbol diagrams in a PDF format component manual can be accurately positioned, and the name, the category, the page number and the coordinate range of the component can be extracted;

3. the symbol image parameter extraction module of the editable component provides a symbol image semantic alignment method for carrying out semantic alignment and sequencing on characters extracted by Pdfplum, can accurately extract various parameters of the editable symbol image, and is suitable for a new component manual with the editable symbol image occupying a high proportion;

4. in the image type symbolic picture parameter identification and extraction module, an OCR method special for component parameters is provided, and characters with upper lineation, lower lineation and oblique lines in the symbolic picture parameters can be accurately identified, so that the accuracy of the image type symbolic picture parameter identification is greatly improved;

5. In the image class symbol chart parameter identification and extraction module, a method combining a special OCR method with a symbol chart semantic alignment method is provided, so that parameters in the image class symbol chart can be accurately identified;

6. by using the semantic logic repairing method, the possible misread and misread data are subjected to semantic logic repairing according to the pin arrangement rule and the pin naming rule of the electronic component, and the accuracy of the extracted parameters is further improved.

Drawings

FIG. 1 is a schematic diagram of a method for automatically extracting symbol map parameters from a PDF format component manual according to the present invention;

FIG. 2 is a flow chart of a method for automatically extracting symbol map parameters from a PDF format parts manual according to the present invention;

FIG. 3 is a flowchart of the preprocessing module of the PDF format component manual preprocessing module;

FIG. 4 is a flow chart of the parameter extraction module of the editable component symbol diagram;

FIG. 5 is a flow chart of the recognition and extraction module for the parameters of the table type BGA image class symbol chart;

fig. 6 is a flowchart of the recognition and extraction of the non-table BGA image class symbol parameters by the recognition and extraction module.

Detailed Description

The invention will now be described in further detail with reference to the drawings and to specific examples.

A method for automatically extracting symbol map parameters from a PDF format component manual is shown in FIG. 1, and comprises a PDF format component manual preprocessing module, an editable component symbol map parameter extracting module, a tabular BGA image type component symbol map parameter identification and extracting module, a non-tabular BGA image type component symbol map parameter identification and extracting module and a symbol map parameter format conversion module;

the PDF format component manual preprocessing module is used for processing all component symbol diagrams in the PDF format component manual by adopting keyword positioning and electronic component symbol diagram classification recognition methods, recognizing the types of the component symbol diagrams, judging whether the component symbol diagrams can be edited or not, and positioning page numbers and coordinate ranges of the component symbol diagrams;

the system comprises an editable component symbol map parameter extraction module, a PDF content extraction tool Pdfplum, a symbol map parameter extraction module, a symbol map semantic alignment method, a semantic logic restoration module and a data processing module, wherein the editable component symbol map parameter extraction module is used for extracting symbol map parameters from page numbers and coordinate ranges calibrated in a PDF format component manual preprocessing module by adopting the editable component symbol map;

The system comprises a table type BGA image type component symbol image parameter identification and extraction module, wherein aiming at the table type BGA image type component symbol image, an image processing technology is adopted to remove colors, then enhancement is carried out, then rectangular block division is carried out, OCR identification is carried out on each rectangular block, symbol image parameters are extracted, then a symbol image semantic alignment method is used for carrying out semantic alignment and sequencing on the symbol image parameters, and finally semantic logic restoration is carried out;

aiming at the non-form type BGA image type element symbol image, the non-form type BGA image type element symbol image parameter identification and extraction module firstly carries out OCR identification, extracts symbol image parameters, then uses a symbol image semantic alignment method to carry out semantic alignment and sequencing on the symbol image parameters, and finally carries out semantic logic restoration;

the symbol map parameter conversion module is used for converting symbol map parameters obtained by the editable component symbol map parameter extraction module, the table type BGA image component symbol map parameter identification and extraction module and the non-table type BGA image component symbol map parameter identification and extraction module into a symbol map file format of the PCB EDA tool;

in addition, the electronic component symbol diagram classifying and identifying method uses a ResNet50 network to select various outline feature component diagram construction data sets from PDF format component manuals of different manufacturers for training, so that the types of the component symbol diagrams can be identified;

The symbol map semantic alignment method is characterized in that the symbol map parameters extracted from the editable component symbol map by using Pdfplum or the symbol map parameters extracted from the image class symbol map by using OCR are subjected to semantic alignment according to the symbol map type characteristics, the pin numbers and the pin names are subjected to semantic alignment, and then the pin numbers are ranked according to the pin numbers.

When the embodiment is used, as shown in fig. 2, the method comprises the following steps:

step S101: preprocessing an input PDF format component manual, searching out symbol diagrams of all components in the PDF format component manual, writing names, types, editability and page numbers and coordinate ranges of the symbol diagrams into a file 'symbols_BasicInfo json', and then turning to step S102;

step S102: reading the basic information of the next component symbol diagram from the file 'symbols_basic info.json', positioning the basic information to a corresponding page of a PDF format component manual, and turning to step S103 if the component symbol diagram is editable, and turning to step S104 if the component symbol diagram is not editable;

step S103: extracting the symbol map parameters of the symbol map by using the editable device symbol map parameter extraction module, writing the symbol map parameters into a file of symbols_parameters.json, and then turning to step S107;

Step S104: judging whether the component is a table type BGA element, if so, turning to step S105, otherwise, turning to step S106;

step S105: extracting the symbol map parameters of the component symbol map by using a table type BGA image class symbol map parameter identification and extraction module, writing the symbol map parameters into a file of symbols_parameters.json, and then turning to step S107;

step S106: extracting the symbol map parameters of the component symbol map by using a non-table type BGA image class symbol map parameter identification and extraction module, writing the symbol map parameters into a file of symbols_parameters.json, and then turning to step S107;

step S107: reading a file "symbols_basic info. Json", judging whether the component symbol images in the PDF format component manual are processed, if so, turning to step S109, and if not, turning to step S102;

step S108: and converting the symbol map parameters of each component in the file 'symbols_parameters. Json' into the symbol map file format of the component of the PCB EDA used by the symbol map parameters according to the user requirements by using a symbol map parameter format conversion module.

In the above embodiment, the electronic component symbol diagram classifying and identifying method includes determining the type number of the component symbol diagram, selecting double-row type, quadrilateral type, coil type BGA and table type BGA component pictures from a PDF format component manual as a data set; adopting Resnet50 as a main structure of a classifier, building a classification network, training the network by using a data set to obtain a classification model containing weight parameters, and specifically, comprising the following steps:

Step S201: determining the type number of the component symbol diagram, selecting 500 pictures of double-row type, quadrilateral, coil type BGA and table type BGA components from a PDF format component manual as a data set, and specifically, according to the training and testing data proportion 8 commonly used by a Resnet50 network: 2, dividing the data into a training data set and a test data set;

step S202: adopting Resnet50 as a main structure of a classifier, building a classification network, training the network by using the data set obtained in the step S201 to obtain a classification model containing weight parameters, in the embodiment, utilizing a space diagram convolution network to extract depth characteristics of pictures from a component principle symbol gallery respectively, sending the depth characteristics into classification network learning, setting the learning rate to be 0.005, setting the iteration number to be 500, obtaining a classification model containing the weight parameters after the network learning converges, finally sending test picture data into the classification model for reasoning, and finally adopting a nonlinear sigmoid function to classify at an output layer, wherein the test pictures belong to different categories according to the probability value.

The symbol graph semantic alignment method of the embodiment comprises the following steps:

step S301: if the symbol diagram of the component is a double-row symbol diagram, the contents are sequentially read according to the row sequence in the upper left coordinate range and the lower right coordinate range of the symbol diagram of the component, the contents comprise pin names and pin numbers, and the contents are sequentially read according to the arrangement rule of the character strings and the numbers: the character string, the number and the character string are aligned semantically, and finally, the character string, the number, the character string and the number are ordered according to the pin numbers;

Step S302: if the character is a quadrilateral component symbol diagram, locking the Y-axis direction coordinate of the second row of characters and the X-axis coordinate of the first group of characters in the third row within the left upper and right lower coordinate ranges of the component symbol diagram to form the left upper starting point of the quadrilateral middle pad frame, locking the Y-axis coordinate of the last row of characters in the quadrilateral and the X-axis coordinate of the first character on the right of the last row of characters in the same way to form the right lower coordinate of the quadrilateral middle pad frame, establishing a rectangle according to the 2 coordinates, deleting all contents in the rectangle, and then according to the arrangement rule of character strings and numbers: the character string, the number and the character string and the number are aligned semantically, the quadrangle is rotated by 90 degrees clockwise, and the arrangement rule of the character string and the number is adopted: the character string, the number and the character string are aligned semantically, and finally, the character string, the number, the character string and the number are ordered according to the pin numbers;

step S303: if the coil type BGA component symbol diagram is, judging whether the periphery of the coil type BGA has marks of numbers and character strings, if so, aligning row and column pin numbers, then reading the pin names of the coils and sequencing according to the pin numbers; otherwise, sequentially reading character strings and numerical combinations in the coils as pin numbers according to a row sequence, sequentially reading corresponding pin names by line feed, sequentially matching the pin numbers and the names according to the adjacent relation between the pin numbers and the names of the same coil, and sequencing.

Step S304: if the table type BGA component symbol diagram is, judging whether ordinal numbers and character strings exist on rows and columns of the periphery of the table type BGA; if so, matching the matching character strings with the numbers according to the rows and the columns in the graph coordinate range, and obtaining pin numbers; then reading the pin name of the table type BGA and aligning with the pin number; if the BGA is in a page-crossing table type, detecting character pin numbers in the rightmost column direction and the bottommost row direction, performing the same processing as the previous page, and finally synthesizing data of all pages and sequencing according to the pin numbers.

According to the OCR recognition method, characters appearing in a component symbol diagram in a PDF format component manual, including numerals, english letters, greek characters, description symbols of electrical characteristics and characters marked on the description symbols are selected as preparation data; adopting Resnet50 as a basic skeleton, and building a text detection network by combining DBnet; the method comprises the steps of extracting the detected visual sequence features by using a transducer, and predicting characters by using an attribute mode, and specifically comprises the following steps in the embodiment:

step S401: selecting characters appearing in a component symbol diagram in a PDF format component manual, including numbers, english letters, greek characters, description symbols of electrical characteristics and characters marked on the description symbols as preparation data, in the embodiment, preparing 26 character strings of which the characters are in 14 fonts and are in capital English letters, wherein the character strings are combined in pairs, namely, a plus character, a minus character, a well number, 26 character strings in capital English letters and 0-9 single characters, 24 character lower case characters of Greek letters, and upper marking of the character strings and lower marking data between 2 characters, wherein 2029 characters are totally prepared;

Step S402: adopting Resnet50 as a basic skeleton, combining DBnet 50 to build a text detection network, in the embodiment, combining the 8:2 allocation principle of Resnet50 to a training set and a test set, sending prepared data into the network for training, performing binarization operation on a segmentation area in a text detection DBnet network based on pixel segmentation to distinguish a text field from a background field, firstly setting an empirical threshold value of binarization judgment, wherein the text field is considered to be the text field when the empirical threshold value is larger than the threshold value and the text field is smaller than the threshold value, and simultaneously predicting the width and the height of a text according to the number of pixels occupied by the text in the segmentation area, and fusing the detection and judgment results into a text box;

step S403: extracting the detected visual sequence features by using a Transformer, predicting characters by using an attention mode, specifically, calling a Resnet50 network which is trained in the text box generated in the step S402 to classify the text image, then taking the classified pixel sequence as the input of a self-attention mechanism, modeling and calculating the similarity between each position and other positions in the pixels of the input sequence, constructing an attention matrix, recording the attention score obtained by multiplying two vectors, judging that the two vectors have high similarity if the calculated values of the two vectors are larger, indicating that the two vectors have close correlation and have the characteristic of outputting successively, and predicting the subsequent test character string according to the attention matrix.

In the embodiment, when the preprocessing module of the PDF format component manual preprocesses, a Json file 'symbols_BasicInfo.json' is created, the PDF format component manual is input, the legality of the PDF format component manual is judged, the next step is entered if the PDF format component manual is normally opened, and the next step is prompted and exited if the PDF format component manual cannot be opened; judging whether the PDF format component manual has a watermark or not, and if so, performing watermark removal processing; deleting all plain text pages in the PDF format component manual, searching the rest pages according to the keywords, and judging whether the editable table type BGA component symbol diagram exists or not, so as to obtain the start-stop page numbers of all editable table type BGA symbol diagrams in the manual; continuously deleting all table pages in the PDF format component manual, converting the rest pages into JPG format pictures, searching and matching by adopting an electronic component symbol map classification and identification method to obtain all other component symbol maps except the editable table type BGA component symbol map in the PDF format component manual, determining the types of the component symbol maps, and positioning the page numbers and coordinate ranges of the component symbol maps; returning to the page where each element symbol diagram is located for obtaining the element symbol diagram, and judging whether the element symbol diagram can be edited or not; symbol map parameters of symbol maps of all components in the manual: the name, type, whether editable, the start-stop page number and the coordinate range of the component are written into a 'symbols_basic info. Json' file.

As shown in fig. 3, the preprocessing module of the PDF format component manual specifically includes the following steps:

step S501: inputting a PDF format component manual;

step S502: judging the legality of a PDF format component manual, judging whether the PDF format component manual is a PDF file, can be opened normally, is an encrypted file or is a pure graph PDF file, outputting prompt information if the PDF format component manual is a non-PDF file or can not be opened normally, and entering step S503 if the PDF format component manual is opened normally;

step S503: judging whether the PDF format component manual has a watermark, if so, transferring to step S404 to carry out watermark removal processing;

step S504: performing targeted removal according to the type of the watermark, and then turning to step S405;

step S505: deleting all plain text pages in the PDF format component manual, searching the rest pages according to the keywords, and judging whether the editable table type BGA component symbol diagram exists or not, so as to obtain the start-stop page numbers of all editable table type BGA symbol diagrams in the manual;

step S505: creating a Json file 'symbols_BasicInfo.json', calling a PDF content editing tool PyPDF2, deleting a plain text page in a PDF format component manual, and searching the rest pages according to keywords Pin Diagram, pinout, ballout, pin (Ball/Signal) assignment, top (Bottom) View and Pin (Ball) Configuration to find whether a table type BGA exists or not, and if so, carrying out the symbol map parameters of the component: the name, the type and the start-stop page number are written into a file 'symbols_basic info. Json';

Step S506: invoking a PDF content editing tool PyPDF2, deleting pure form pages in the PDF format component manual processed in the step S505, then converting all the rest pages into JPG format pictures, using an electronic component symbol diagram classification recognition method, recognizing the types of all component symbol diagrams in the JPG pictures, and positioning page numbers and coordinate positions of the component symbol diagrams in the PDF format component manual;

step S507: returning to the position of each component symbol diagram in the PDF format component manual, calling a PDF content editing tool PyPDF2, and judging whether the component symbol diagram is editable or not;

step S508, the symbol map parameters of the symbol maps of all the components identified in the step S506 are: the name, type, editability, page number and coordinate range are written into the file "symbols_basicinfo. Json".

In this embodiment, when the parameter extraction module of the editable component symbol chart performs parameter extraction, the parameter extraction module reads basic information of the editable component symbol chart from the "symbols_basic info. Json" file;

if the table type BGA component is located at the first page of the table in a PDF format component manual, firstly using a PDF content extractor Pdfplum to read all characters in the table, then using a symbol graph semantic alignment method to perform semantic alignment and sequencing, then processing subsequent pages of the table type BGA page by page, sequencing all symbol graph parameters according to pin numbers, and finally sequencing the symbol graph parameters: the name of the component, the pin number and the information of the pin name are written into a 'symbols_parameters.json' file;

If the coil type BGA component/quadrilateral component/double-row type component is used, firstly using Pdfplum number to read all symbol map parameters in the coordinate range obtained in the step S506, then using a symbol map semantic alignment method to perform semantic alignment and sequencing according to the characteristics of the coil type BGA component/quadrilateral component/double-row type component, then performing semantic logic repair, and finally repairing the symbol map parameters: the component name, pin number, pin name, pin direction information is written into the "symbols_parameters. Json" file.

As shown in fig. 4, when the parameter extraction module of the editable component symbol chart performs parameter extraction, the method specifically includes the following steps:

step S601: reading basic information of an editable component symbol diagram from the symbol_basic info.json file, judging whether the component symbol diagram is a table type BGA, if so, turning to step S602, otherwise, turning to step S605;

step S602: positioning the first page of the PDF format component manual on a table type BGA (usually a large BGA, requiring multiple pages for expression), and then shifting to step S603;

step S603: extracting the numbering sequence (a, B, c., 1,2, 3.) and the ball pin name of the four sides of the table using PDF content extraction tool Pdfplumber, performing semantic alignment using a symbol graph semantic alignment method, and then proceeding to step S604;

Step S604: judging whether the next page of the PDF format component manual is a continuous form or not, wherein the continuous form is not more than 4 pages, if so, positioning the next page of the PDF format component manual, and turning to step S603; otherwise, go to step S610;

step S605: judging whether the current component symbol diagram is a coil type BGA, if so, turning to step S606, otherwise, turning to step S607;

step S606: extracting text content in the component symbol diagram by using a PDF content extraction tool Pdfplum, performing semantic alignment according to parameter characteristics of the coil type BGA component by using a symbol diagram semantic alignment method, and then turning to step S610;

step S607: judging whether the current symbol diagram is a quadrilateral component, if so, turning to step S608, otherwise, turning to step S609;

step S608: extracting text content in the symbol diagram of the component by using a PDF content extraction tool Pdfplum, performing semantic alignment according to parameter characteristics of the quadrilateral component by using a symbol diagram semantic alignment method, and then turning to step S610;

step S609: extracting text content in the symbol diagram of the component by using a PDF content extraction tool Pdfplum, performing semantic alignment according to the parameter characteristics of the double-row component by using a symbol diagram semantic alignment method, and then turning to step S610;

Step S610: performing semantic alignment on the component parameter information (pin number, pin name and pin direction) obtained in the steps S604, S606, S608 and S609, performing semantic logic repair (for example, performing semantic logic repair on the condition that two identical pin numbers appear and a certain pin number is missing), and finally sequencing the parameter information of all pins according to the pin numbers;

step S611: the name of the component and the parameter information sorted in step S610 are written into a file "symbols_parameters.

In this embodiment, when the recognition and extraction module performs recognition and extraction on the table-type BGA image class symbol map parameters, the basic information of a table-type BGA image class symbol map is read from the "symbols_basic info. Json" file; the method comprises the steps of firstly carrying out gray processing on a picture in a coordinate range, changing a color picture into a binary picture, and then carrying out enhancement processing on characters in the picture by using an interpolation method; dividing the fine-grained rectangular blocks of the table diagram, performing OCR (optical character recognition) on each fine-grained rectangular block, and performing character recognition on each fine-grained rectangular block; judging whether the next page is a continuation of the component symbol diagram, if so, repeating the steps until the full-diagram identification is completed; and (3) carrying out semantic alignment and sequencing on the recognition result by adopting a symbol graph semantic alignment method, then carrying out semantic logic restoration, and finally carrying out symbol graph parameter: the component name, pin number, pin name information are written into the symbols_parameters.

As shown in fig. 5, when the recognition and extraction module of the table type BGA image class symbol map parameters performs recognition and extraction, the method specifically includes the following steps:

step S701: reading basic information of a table type BGA image type symbol image from a "symbols_BasicInfo.json" file, performing graying treatment on a certain page of the table type BGA image type symbol image (usually in color and possibly with multiple pages), converting into a binary image, and then converting into a step S702;

step S702: performing image enhancement processing on the binary image, in this embodiment, filtering out impurity information by using a corrosion expansion algorithm to make the character boundaries of the text image clear, and then turning to step S703;

step S703: dividing the table-type BGA picture into a plurality of rectangular blocks, in this embodiment, using 1×1 rectangular blocks, and then proceeding to step S704;

step S704: performing OCR (optical character recognition) on each rectangular block obtained in the step S703, and then turning to the step S705;

step S705: performing semantic alignment on the character information obtained by recognition by using a symbol graph semantic alignment method, and then turning to step S706;

step S706: judging whether the next page of the PDF format component manual is a continuous drawing, and if the number of the continuous drawing pages is smaller than 4, turning to step S707 if yes, otherwise turning to step S708;

Step S707: positioning the next page of the PDF format component manual, and turning to step S701;

step S708: performing semantic alignment on the component parameter information (pin number and pin name) obtained in the step S706, performing semantic logic repair, and finally sequencing according to the pin number;

step S709: the name of the component and the parameter information sorted in step S708 are written into the file "symbols_parameters.

In this embodiment, when the non-table type BGA image class symbol map parameter identifying and extracting module identifies and extracts, the basic information of a non-table type BGA image class symbol map is read from the "symbols_basic info. Json" file; OCR recognition is carried out to recognize all characters in the image class symbol diagram; and (3) carrying out semantic alignment and sequencing on the recognized result by adopting a symbol graph semantic alignment method, then carrying out logic restoration, and finally carrying out symbol graph parameter: the component name, pin number, pin name, pin direction information is written into the symbols_parameters.

As shown in fig. 6, when the recognition and extraction module of the non-form BGA image class symbol map parameters performs recognition and extraction, the method specifically includes the following steps:

Step S801: reading basic information of a non-form type BGA image symbol diagram from a symbol_BasicInfo.json file, performing OCR (optical character recognition) and recognizing text contents in the component symbol diagram;

step S802: judging whether the component is a coil type BGA, if so, turning to step S803, otherwise, turning to step S804;

step S803: using a symbol graph semantic alignment method, performing semantic alignment and sequencing on the content obtained in the step S801 according to the parameter characteristics of the coil type BGA component, performing semantic logic repair, and then turning to the step S807;

step S804: judging whether the device is a quadrilateral component, if so, turning to step S805, otherwise, turning to step S806;

step S805: using a symbol graph semantic alignment method, aligning and sequencing the content semantics obtained in the step S801 according to the parameter characteristics of the quadrilateral component, then performing semantic logic repair, and then turning to the step S807;

step S806: using a symbol graph semantic alignment method, aligning and sequencing the content semantics obtained in the step S801 according to the parameter characteristics of the double-row type components, then performing semantic logic repair, and then turning to the step S807;

step S807: the name of the component and the parameter information sorted in steps S803, S805 and S806 are written into the file "symbols_parameters.

Claims

1. A method for automatically extracting symbol map parameters from a PDF format component manual is characterized by comprising the following steps of: the system comprises a PDF format component manual preprocessing module, an editable component symbol map parameter extraction module, a table type BGA image type component symbol map parameter identification and extraction module, a non-table type BGA image type component symbol map parameter identification and extraction module and a symbol map parameter format conversion module;

the PDF format component manual preprocessing module adopts keyword positioning and electronic component symbol diagram classification recognition methods to extract all component symbol diagrams in the PDF format component manual, recognize the types of the component symbol diagrams, judge whether the component symbol diagrams can be edited or not and position page numbers and coordinate ranges of the component symbol diagrams;

the editable component symbol image parameter extraction module adopts a PDF content extraction tool Pdfplum to extract symbol image parameters from page numbers and coordinate ranges calibrated in the PDF format component manual preprocessing module aiming at the editable component symbol image, then adopts a symbol image semantic alignment method to perform semantic alignment and sequencing on the symbol image parameters, and finally performs semantic logic restoration;

2. The method for automatically extracting the symbol map parameters from the PDF format component manual of claim 1, wherein: the electronic component symbol diagram classifying and identifying method comprises the steps of determining the type number of a component symbol diagram, and selecting double-row type, quadrilateral, coil type BGA and table type BGA component pictures from a PDF format component manual as a data set; and adopting Resnet50 as a main structure of the classifier, building a classification network, and training the network by using the data set to obtain a classification model containing weight parameters.

3. The method for automatically extracting the symbol map parameters from the PDF format component manual of claim 1, wherein: the symbol graph semantic alignment method comprises the following steps:

if the coil type BGA component symbol diagram is, judging whether the periphery of the coil type BGA has marks of numbers and character strings, if so, aligning row and column pin numbers, then reading the pin names of the coils and sequencing according to the pin numbers; otherwise, sequentially reading character strings and digital combinations in the coils as pin numbers according to a row sequence, sequentially reading corresponding pin names by line feed, sequentially matching the pin numbers and the names according to the adjacent relation between the pin numbers and the names of the same coil, and sequencing;

4. The method for automatically extracting the symbol map parameters from the PDF format component manual of claim 1, wherein: when OCR recognition is carried out, characters appearing in a component symbol diagram in a PDF format component manual, including numbers, english letters, greek characters, description symbols of electrical characteristics and characters marked on the description symbols, are selected as preparation data; adopting Resnet50 as a basic skeleton, and building a text detection network by combining DBnet; the detected visual sequence features are extracted using a transducer and characters are predicted using an attribute mode.

5. The method for automatically extracting the symbol map parameters from the PDF format component manual of claim 1, wherein: when the PDF format component manual preprocessing module performs preprocessing, the PDF format component manual is input, the legality of the PDF format component manual is judged, the next step is entered when the PDF format component manual is normally opened, and the prompt and the exit are prompted when the PDF format component manual cannot be opened; judging whether the PDF format component manual has a watermark or not, and if so, performing watermark removal processing; deleting all plain text pages in the PDF format component manual, searching the rest pages according to the keywords, and judging whether the editable table type BGA component symbol diagram exists or not, so as to obtain the start-stop page numbers of all editable table type BGA symbol diagrams in the manual; creating a Json file 'symbols_BasicInfo.json', continuously deleting all table pages in the PDF format component manual, converting the rest pages into JPG format pictures, searching and matching by adopting an electronic component symbol diagram classification recognition method to obtain all other component symbol diagrams except the editable table type BGA component symbol diagram in the PDF format component manual, determining the type of the component symbol diagram, and positioning the page number and coordinate range of the component symbol diagram; returning to the page where each element symbol diagram is located for obtaining the element symbol diagram, and judging whether the element symbol diagram can be edited or not; symbol map parameters of symbol maps of all components in the manual: the name, type, editable or not, the start-stop page number and the coordinate range of the component are written into a Symbols_BasicInfo.json file.

6. The method for automatically extracting the symbol map parameters from the PDF format component manual of claim 5, wherein: when the parameter extraction module of the editable component symbol chart performs parameter extraction, reading basic information of the editable component symbol chart from a 'symbols_basic info. Json' file;

7. The method for automatically extracting the symbol map parameters from the PDF format component manual of claim 6, wherein: when the table type BGA image type symbol image parameter identification and extraction module identifies and extracts, basic information of a table type BGA image type symbol image is read from a 'symbols_basic info.json' file; the method comprises the steps of firstly carrying out gray processing on a picture in a coordinate range, changing a color picture into a binary picture, and then carrying out enhancement processing on characters in the picture by using an interpolation method; dividing the fine-grained rectangular blocks of the table diagram, performing OCR (optical character recognition) on each fine-grained rectangular block, and performing character recognition on each fine-grained rectangular block; judging whether the next page is a continuation of the component symbol diagram, if so, repeating the steps until the full-diagram identification is completed; and (3) carrying out semantic alignment and sequencing on the recognition result by adopting the symbol graph semantic alignment method, then carrying out semantic logic restoration, and finally carrying out symbol graph parameter adjustment: the component name, pin number, pin name information are written into the symbols_parameters.

8. The method for automatically extracting the symbol map parameters from the PDF format component manual of claim 1, wherein: when the non-form type BGA image type symbol image parameter identification and extraction module identifies and extracts, the basic information of a non-form type BGA image type symbol image is read from a 'symbols_basic info.json' file; OCR recognition is carried out to recognize all characters in the image class symbol diagram; and carrying out semantic alignment and sequencing on the identified result by adopting the symbol graph semantic alignment method, then carrying out logic restoration, and finally carrying out symbol graph parameter adjustment: the component name, pin number, pin name, pin direction information is written into the "symbols_parameters. Json" file.