CN115422884A

CN115422884A - Method, system, equipment and storage medium for processing beacon data

Info

Publication number: CN115422884A
Application number: CN202210976758.3A
Authority: CN
Inventors: 余戈磊; 莫华瞻
Original assignee: Guangzhou Zhongcheng Big Data Technology Co ltd
Current assignee: Guangzhou Zhongcheng Big Data Technology Co ltd
Priority date: 2022-08-15
Filing date: 2022-08-15
Publication date: 2022-12-02

Abstract

The invention discloses a method, a system, equipment and a storage medium for processing beacon data, wherein the method comprises the following steps: acquiring a text to be processed; reordering the texts to be processed, and determining reordered texts and text ordering values; inputting the reordered text into a pre-trained field entity extraction model to determine a field entity; sorting the field entities according to the text sorting values, and determining entity key value pairs; and performing data connection processing on the entity key value pair according to the layout classification model to determine formatted signaling data. The embodiment of the invention can convert the unformatted data into the formatted data, improves the data processing efficiency and can be widely applied to the technical field of data processing.

Description

Method, system, equipment and storage medium for processing standard message data

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, a system, a device, and a storage medium for processing beacon data.

Background

The bid inviting and bidding information includes key information such as bid inviting unit, bid winning unit, name and price of bid winning unit, and these data have great reference value for medical equipment suppliers, medical institutions and manufacturers. However, these information are not always distributed as formatted data, and since the layout of the bidding text is diversified, it is very difficult to extract the designated data from the bidding data by using the related method, and there is a problem that it is difficult to extract the data entity and generate the key value matching with the field corresponding to the entity. Therefore, how to convert the unformatted data into the formatted data becomes a technical problem to be solved urgently.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, a system, a device and a storage medium for processing beacon data, so as to implement data normalization processing.

In one aspect, the present invention provides a method for processing beacon data, including:

acquiring a text to be processed;

reordering the texts to be processed, and determining reordered texts and text ordering values;

inputting the reordered text into a pre-trained field entity extraction model to determine a field entity;

sorting the field entities according to the text sorting values to determine entity key value pairs;

and performing data connection processing on the entity key value pair according to the layout classification model to determine formatted signaling data.

Optionally, the obtaining the text to be processed includes:

crawling the webpage source codes according to a crawler technology to determine webpage data;

and performing format conversion processing on the webpage data to determine a text to be processed.

Optionally, the reordering preprocessing the text to be processed to determine a reordered text includes:

carrying out coordinate acquisition processing on the text to be processed, and determining text element coordinates, wherein the text element coordinates comprise a text element abscissa and a text element ordinate;

grouping the texts to be processed according to the text element vertical coordinates to determine text element grouping;

sorting the text element groups from big to small according to the vertical coordinates of the text elements, and determining a first sorting text;

and sequencing the first sequencing texts from small to large according to the abscissa of the text element to determine the reordered texts.

Optionally, the reordering the text to be processed to determine a text ordering value includes:

performing label analysis processing on the text to be processed to determine a text list;

and comparing the text list with the reordered text line by line, and determining a text ordering value.

Optionally, the inputting the reordered text into a pre-trained field entity extraction model, and determining a field entity includes:

acquiring a training data set;

performing text labeling processing on the training data set to determine a labeled data set;

carrying out named entity sequence annotation deep learning on the annotation data set, and determining the field entity extraction model;

and inputting the reordered text into the field entity extraction model to determine the field entity.

Optionally, the sorting the field entities according to the text sorting value to determine entity key-value pairs includes:

the field entity comprises an entity header and an entity value;

matching the field entities with the text sorting values to determine the field entity sorting values;

sorting the field entities according to the field entity sorting value to determine a sorting set;

and matching the entity heads and the entity values in the sorting set to determine entity key value pairs.

Optionally, the performing data connection processing on the entity key-value pair according to the layout classification model to determine formatted signaling data includes:

determining a data alignment rule according to the layout classification model;

and performing data connection on the entity key value pair according to the data alignment rule to determine formatted signaling data.

On the other hand, the embodiment of the invention also discloses a beacon signal data processing system, which comprises:

the first module is used for acquiring a text to be processed;

the second module is used for carrying out reordering pretreatment on the text to be treated and determining a reordered text and a text ordering value;

the third module is used for inputting the reordered text into a pre-trained field entity extraction model and determining a field entity;

a fourth module, configured to perform sorting processing on the field entities according to the text sorting values, and determine an entity key value pair;

and the fifth module is used for performing data connection processing on the entity key value pair according to the layout classification model and determining formatted signaling data.

On the other hand, the embodiment of the invention also discloses an electronic device, which comprises a processor and a memory;

the memory is used for storing programs;

the processor executes the program to implement the method as described above.

On the other hand, the embodiment of the invention also discloses a computer readable storage medium, wherein the storage medium stores a program, and the program is executed by a processor to realize the method.

In another aspect, an embodiment of the present invention further discloses a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and the computer instructions executed by the processor cause the computer device to perform the foregoing method.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects: the embodiment of the invention obtains the text to be processed; reordering the text to be processed, and determining reordered text and a text ordering value; in addition, the embodiment of the invention inputs the reordering text into a field entity extraction model trained in advance to determine the field entity, can save model training time through the reordering text, and can better acquire text information through the field entity extraction model; furthermore, the embodiment of the invention carries out sorting processing on the field entities according to the text sorting value, and determines an entity key value pair; and performing data connection processing on the entity key value pairs according to the layout classification model, determining formatted signaling data, converting unformatted data into formatted data, and improving the data processing efficiency.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a beacon data processing method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, an embodiment of the present invention provides a beacon data processing method, including:

s101, obtaining a text to be processed;

s102, carrying out reordering pretreatment on the text to be processed, and determining a reordered text and a text ordering value;

s103, inputting the reordered text into a pre-trained field entity extraction model to determine a field entity;

s104, sorting the field entities according to the text sorting values, and determining entity key value pairs;

and S105, performing data connection processing on the entity key value pair according to the layout classification model, and determining formatted signaling data.

Specifically, the embodiment of the invention firstly obtains a text to be processed, wherein the text to be processed is a text containing beacon data; then, the embodiment reorders the texts to be processed to obtain reordered texts and text ordering values; then, the reordered text is input into a pre-trained field entity extraction model to obtain an extracted field entity, wherein a BERT + CRF model (a pre-trained model + a conditional random field model) is adopted as the field entity extraction model in the embodiment. Next, in this embodiment, the extracted field entities are sorted according to the text sorting value pair, and the entity key value pair can be obtained by matching the field sets obtained by sorting. Finally, in this embodiment, data connection processing is performed on the entity key value pair according to the data matching rule in the customized layout classification model, so as to obtain formatted signaling data.

Further as a preferred embodiment, the acquiring the text to be processed includes:

Specifically, in this embodiment, a web page source code of a markup message original text is crawled and stored in a local database, when data is read, the source code is used for performing format conversion on the source code data in the memory according to the structure characteristics of a web page tag by using tools such as a pdfkit package and an io package, and the format conversion step is as follows: firstly, converting hypertext markup language (HTML) into a byte stream, then converting the byte stream into a portable formatted file (PDF), and finally obtaining a text to be processed.

Further as a preferred embodiment, the reordering preprocessing the text to be processed to determine the reordered text includes:

Specifically, the embodiment obtains the coordinates of the text elements in the text to be processed, and sorts the coordinates of the text elements. The text element coordinates comprise an abscissa (x) and an ordinate (y), the texts to be processed are grouped according to the text element ordinate to obtain text element groups, and the number of the text element groups is the number of the text element ordinate. Then, the embodiment sorts the text element groups from big to small according to the vertical coordinates of the text elements, and obtains a first sorted text corresponding to the reading mode from top to bottom; then, the embodiment sorts the vertical coordinates of the text elements in the first sorted text from small to large, and obtains the sorted text from left to right corresponding to the reading mode.

Further as a preferred embodiment, the performing reordering preprocessing on the text to be processed and determining a text ordering value includes:

and comparing the text list with the reordered texts line by line, and determining a text ordering value.

Specifically, in the embodiment of the present invention, when converting the format of the web page data, the web page data of the text to be processed is also retained, and the text content in the < table > tag (and the < tr > and < td > tags inside the < table > tag, the < li > tag, and the < span > tag in the web page data (HTML format text) being read is analyzed by using a tool such as an lxml. Eree package, and the text in each tag is added to the text list. Then, the embodiment of the present invention compares the text list with the reordered text line by line, and it should be noted that the reordered text is first subjected to one-time and sorting, and the text sorting result is recorded in sequence according to the sorting result of the reordered text. And modifying the text sorting result to obtain a final text sorting value when the text list is compared with the reordered text line by line. Comparing list text elements in the text list with text elements in the reordered text line by line, if one list text contains a plurality of reordered texts, giving a text ordering result value of the first reordered text to the list text, and adding the list text into a first list; if one reordering text contains a plurality of list texts, sequentially generating new ordering values for the plurality of list texts in a first list, sequentially adding the plurality of list texts into the first list, remembering the number of the generated ordering values, and correcting the ordering value of the next reordering text; and by analogy, obtaining the unique ranking value of the analyzed text to obtain the text ranking value. The purpose of this is that, in general, the HTML parsed text will not produce fragmented text, but will not have sequence features. The text parsed in the PDF mode has sequence features, but fragmented text may be generated. Therefore, combining the two methods, a text sequence with relatively complete sentences and sequence characteristics can be obtained.

Further as a preferred embodiment, the inputting the reordered text into a pre-trained field entity extraction model, and determining a field entity includes:

acquiring a training data set;

The embodiment of the invention collects the standard signal content from a multi-platform channel, and selects part of the standard signal files as a training data set; marking the training data set through a text marking tool, and marking the training data set for model training; and finally, carrying out named entity sequence annotation deep learning by using a BERT + CRF model to obtain a field entity extraction model. In this embodiment, the combination scheme of "text rearrangement preprocessing + BERT + CRF" is selected to extract the information of the bid-winning product from the markup message text, and compared with the method of directly labeling text sequences by using other artificial neural network models (such as RNN, LSTM, GRU, biLSTM, etc.), the method has the following advantages in that: 1. strong pre-training weights are available for use, downstream training is carried out through further fine adjustment, the effect is good, and a large amount of training time can be saved; 2. the non-cyclic structure has higher calculation speed; 3. by adopting an attention mechanism, the method has better characteristic extraction capability, is beneficial to efficiently understanding text information, and can effectively avoid the problem of gradient disappearance easily appearing in a deep cyclic neural network, so that the efficiency of text sequence labeling is obviously improved.

Further as a preferred embodiment, the sorting the field entities according to the text sorting value to determine entity key-value pairs includes:

the field entity comprises an entity header and an entity value;

Specifically, the embodiment uses a BERT + CRF model to perform sequence labeling on the rearranged text, i.e. block extraction on the named entity, so as to obtain an entity header (e.g. "title name", "bidding unit address") and an entity value (e.g. "dental handpiece", "xx province xx city xx district xx middle road 104 number"), respectively. Then, the present embodiment matches the field entities with the text ranking values to obtain the text ranking value corresponding to each field entity. And then, sorting the field entities according to the text sorting value to obtain a sorting set after the field entities are sorted. And matching the adjacent entity heads and the entity values in the sorting set to obtain an entity key value pair.

As a further preferred embodiment, the performing data connection processing on the entity key-value pairs according to the layout classification model to determine formatted signaling data includes:

determining a data alignment rule according to the layout classification model;

Specifically, in the present embodiment, a data alignment rule is set for the obtained entity header and entity value by using a layout classification model, and a content format is configured as a { field header: the set of key-value pairs of field entities, ultimately forms a product-specific data chain, i.e., formatted product data (e.g., { ' brand name ': pacemaker ' }, { ' brand ': aaa ' }, { ' model ': y210' }, { ' quantity ': 10' }, { ' unit price ': xxx element ' }, { ' bid amount ': xxxx element ' }, { ' supplier ': xxxxx company limited ' }).

the first module is used for acquiring a text to be processed;

the second module is used for carrying out reordering pretreatment on the text to be processed and determining reordered text and a text ordering value;

a third module, configured to input the reordered text into a pre-trained field entity extraction model, and determine a field entity;

the fourth module is used for carrying out sorting processing on the field entities according to the text sorting value and determining entity key value pairs;

and the fifth module is used for performing data connection processing on the entity key value pair according to the layout classification model and determining formatted signaling data. Corresponding to the method of fig. 1, an embodiment of the present invention further provides an electronic device, including a processor and a memory; the memory is used for storing programs; the processor executes the program to implement the method as described above.

Corresponding to the method of fig. 1, the embodiment of the present invention further provides a computer-readable storage medium, which stores a program, and the program is executed by a processor to implement the method as described above.

The embodiment of the invention also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor to cause the computer device to perform the method illustrated in fig. 1.

In summary, the embodiments of the present invention have the following advantages: according to the embodiment of the invention, the complexity of extracting the field entity by the field entity extraction model can be reduced by reordering the text to be processed; meanwhile, the field entity is extracted by adopting a BERT + CRF model, so that the training time can be saved, and the attention mechanism is adopted, so that the method has better characteristic extraction capability, is beneficial to efficiently understanding text information, improves the efficiency of text sequence labeling, and can better extract the field entity.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module. It will also be understood that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is to be determined from the appended claims along with their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Further, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description of the specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for processing beacon data, comprising:

acquiring a text to be processed;

reordering the text to be processed, and determining reordered text and a text ordering value;

2. The method of claim 1, wherein the obtaining the text to be processed comprises:

3. The method of claim 1, wherein the performing a reordering pre-process on the text to be processed to determine reordered text comprises:

grouping the texts to be processed according to the vertical coordinates of the text elements to determine text element groups;

and sequencing the first sequencing text from small to large according to the abscissa of the text element, and determining the reordered text.

4. The method according to any one of claims 1 or 2, wherein the performing a reordering pre-process on the text to be processed to determine a text ordering value comprises:

5. The method of claim 1, wherein said entering said re-ordered text into a pre-trained field entity extraction model to determine field entities comprises:

acquiring a training data set;

carrying out named entity sequence annotation deep learning on the annotation data set, and determining the field entity extraction model; and inputting the reordered text into the field entity extraction model to determine the field entity.

6. The method of claim 1, wherein said sorting said field entities according to said text sorting values to determine entity key-value pairs comprises:

the field entity comprises an entity header and an entity value;

matching the field entities with the text sorting value to determine the field entity sorting value;

and matching the entity heads and the entity values in the sorting set to determine the entity key value pairs.

7. The method according to claim 1, wherein said performing data connection processing on said entity key-value pairs according to a layout classification model to determine formatted signaling data comprises:

determining a data alignment rule according to the layout classification model;

8. A tagged data processing system, comprising:

the first module is used for acquiring a text to be processed;

9. An electronic device comprising a processor and a memory;

the memory is used for storing programs;

the processor executing the program realizes the method according to any one of claims 1-7.

10. A computer-readable storage medium, characterized in that the storage medium stores a program, which is executed by a processor to implement the method according to any one of claims 1-7.