CN113076746B - Data processing method and system, storage medium and computing device - Google Patents
Data processing method and system, storage medium and computing device Download PDFInfo
- Publication number
- CN113076746B CN113076746B CN202010010139.XA CN202010010139A CN113076746B CN 113076746 B CN113076746 B CN 113076746B CN 202010010139 A CN202010010139 A CN 202010010139A CN 113076746 B CN113076746 B CN 113076746B
- Authority
- CN
- China
- Prior art keywords
- granularity
- original
- geographical area
- address text
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 56
- 238000012545 processing Methods 0.000 claims abstract description 245
- 238000002372 labelling Methods 0.000 claims abstract description 92
- 238000000034 method Methods 0.000 claims abstract description 63
- 230000011218 segmentation Effects 0.000 claims abstract description 39
- 238000012549 training Methods 0.000 claims description 103
- 239000013598 vector Substances 0.000 claims description 92
- 230000015654 memory Effects 0.000 claims description 43
- 239000011159 matrix material Substances 0.000 claims description 27
- 230000008569 process Effects 0.000 claims description 22
- 238000001914 filtration Methods 0.000 claims description 11
- 238000005516 engineering process Methods 0.000 abstract description 2
- 238000012937 correction Methods 0.000 description 21
- 230000005540 biological transmission Effects 0.000 description 10
- 238000004891 communication Methods 0.000 description 9
- 230000007246 mechanism Effects 0.000 description 8
- 238000012015 optical character recognition Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000002457 bidirectional effect Effects 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000005286 illumination Methods 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Character Discrimination (AREA)
Abstract
The application discloses a data processing method and system, a storage medium and a computing device. Wherein the method comprises the following steps: acquiring an original address text, wherein an error exists in the original address text; performing word segmentation processing on the original address text by using the trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area; processing the original first granularity geographical area by using the trained text generation model to obtain a target first granularity geographical area, wherein the target first granularity geographical area is correct; a target address text is generated based on the target first granularity geographical region and the original second granularity geographical region. The application solves the technical problem of lower processing accuracy caused by the fact that the data processing method in the related technology is realized by searching the clean address text in the address library, and if the address library does not contain the corresponding clean address text.
Description
Technical Field
The present application relates to the field of data processing, and in particular, to a data processing method and system, a storage medium, and a computing device.
Background
In the logistics industry, a user can upload address information to the cloud end in a photographing mode, but because of the complex natural scene, noise such as illumination, crease lines and the like exists, errors can exist in OCR (Optical Character Recognition ) inevitably, and the higher the address error rate is, the user experience can be seriously affected.
To solve the above problem, for the address text recognized by OCR, a clean sub-address (i.e. a sub-address without error) in the original address text may be searched in the address library first, then the address error in front of the sub-address is complemented and corrected according to the clean sub-address, but the address error in back cannot be corrected, and if there is no clean sub-address in the original address text, no error correction can be performed.
Aiming at the problem of lower processing accuracy caused by the fact that the data processing method in the related art is realized by searching the clean address text in the address library, if the address library does not contain the corresponding clean address text, no effective solution is proposed at present.
Disclosure of Invention
The embodiment of the application provides a data processing method and system, a storage medium and a computing device, which are used for at least solving the technical problem that the processing accuracy is lower if the address library does not contain corresponding clean address text because the data processing method is realized by searching the clean address text in the address library in the related art.
According to an aspect of an embodiment of the present application, there is provided a data processing method including: acquiring an original address text, wherein an error exists in the original address text; performing word segmentation processing on the original address text by using the trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area; processing the original first granularity geographical area by using the trained text generation model to obtain a target first granularity geographical area, wherein the target first granularity geographical area is correct; a target address text is generated based on the target first granularity geographical region and the original second granularity geographical region.
According to another aspect of the embodiment of the present application, there is also provided a data processing method, including: acquiring an original address text sent by a client, wherein an error exists in the original address text; performing word segmentation processing on the original address text by using the trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area; processing the original first granularity geographical area by using the trained text generation model to obtain a target first granularity geographical area, wherein the target first granularity geographical area is correct; generating a target address text based on the target first granularity geographical area and the original second granularity geographical area; and sending the target address text to the client.
According to another aspect of the embodiment of the present application, there is also provided a data processing method, including: triggering a client to generate a processing instruction; the client acquires an original address text based on the processing instruction, wherein an error exists in the original address text; the method comprises the steps that a client sends an original address text to a server and receives a target address text returned by the server, wherein the target address text is generated based on a target first granularity geographical area and an original second granularity geographical area, the target first granularity geographical area is obtained by processing the original first granularity geographical area by using a trained text generation model, the original first granularity geographical area and the original second granularity geographical area are obtained by word segmentation of the original address text by using a trained sequence labeling model, and the target first granularity geographical area is correct; the client outputs the target address text.
According to another aspect of the embodiment of the present application, there is also provided a data processing method, including: triggering a client to generate a processing instruction; the client acquires an image containing an original address text based on a processing instruction, wherein an error exists in the original address text; the method comprises the steps that a client sends an image to a server and receives a target address text returned by the server, wherein the image is identified by the server, the target address text is generated based on a target first granularity geographical area and an original second granularity geographical area, the target first granularity geographical area is obtained by processing the original first granularity geographical area through a trained text generation model, the original first granularity geographical area and the original second granularity geographical area are obtained by word segmentation of the original address text through a trained sequence labeling model, and the target first granularity geographical area is correct; the client outputs the target address text.
According to another aspect of the embodiment of the present application, there is also provided a storage medium, where the storage medium includes a stored program, and when the program runs, the device on which the storage medium is controlled to execute the above-mentioned data processing method.
According to another aspect of embodiments of the present application, there is also provided a computing device including: the processor is used for running a program stored in the memory, wherein the program executes the data processing method.
According to another aspect of an embodiment of the present application, there is also provided a data processing system including: a processor; and a memory, coupled to the processor, for providing instructions to the processor for processing the steps of: acquiring an original address text, wherein an error exists in the original address text; performing word segmentation processing on the original address text by using the trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area; processing the original first granularity geographical area by using the trained text generation model to obtain a target first granularity geographical area, wherein the target first granularity geographical area is correct; a target address text is generated based on the target first granularity geographical region and the original second granularity geographical region.
According to another aspect of the embodiment of the present application, there is also provided a data processing method, including: acquiring original data; processing the original data by using the trained first model to obtain original first granularity data and original second granularity data; processing the original first granularity data by using the trained second model to obtain target first granularity data; target data is generated based on the target first granularity data and the original second granularity data.
In the embodiment of the application, after the original address text with errors is obtained, firstly, word segmentation processing is carried out on the original address text by utilizing a trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area, then, the original first granularity geographical area is processed by utilizing a trained text generation model to obtain an error-corrected target first granularity geographical area, and further, the target address text can be obtained by combining the target first granularity geographical area and the original second granularity geographical area, so that the aim of address error correction is fulfilled. Compared with the related art, the administrative division can be extracted through sequence labeling, then the noise administrative division is generated through text generation to the clean administrative division, the original address text is not required to have a clean sub-address, and the whole administrative division text can be corrected, so that the effect of improving the accuracy of address correction is achieved, the problem that the data processing method in the related art is realized in a mode of searching the clean address text in the address library is solved, and if the address library does not contain the corresponding clean address text, the technical problem of lower processing accuracy is caused.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
fig. 1 is a block diagram of a hardware structure of a computer terminal for implementing a data processing method according to an embodiment of the present application;
FIG. 2 is a flow chart of a data processing method according to embodiment 1 of the present application;
FIG. 3 is a schematic diagram of a scenario of an alternative data processing method according to an embodiment of the present application;
FIG. 4 is a flow chart of an alternative address processing method according to an embodiment of the application;
FIG. 5 is a flowchart of a data processing method according to embodiment 2 of the present application;
FIG. 6 is a flowchart of a data processing method according to embodiment 3 of the present application;
FIG. 7 is a schematic diagram of an alternative user interface according to an embodiment of the application;
FIG. 8 is a flowchart of a data processing method according to embodiment 4 of the present application;
fig. 9 is a schematic diagram of a data processing apparatus according to embodiment 5 of the present application;
FIG. 10 is a schematic view of a data processing apparatus according to embodiment 6 of the present application;
FIG. 11 is a schematic view of a data processing apparatus according to embodiment 7 of the present application;
FIG. 12 is a schematic view of a data processing apparatus according to embodiment 8 of the present application;
FIG. 13 is a flowchart of a data processing method according to embodiment 10 of the present application; and
Fig. 14 is a block diagram of a computer terminal according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, partial terms or terminology appearing in the course of describing embodiments of the application are applicable to the following explanation:
sequence labeling: may be indexing the attributes of each element in a sequence.
Text generation: may refer to generating text from data.
Administrative division: can be divided into three levels of provincial administrative districts, county administrative districts and rural administrative districts.
Two-way long-short term memory network: bi-directional Long Short-Term Memory, biLSTM, which is a combination of forward LSTM and backward LSTM, may be a time-recurrent neural network.
Conditional random field: conditional Random Fields, CRF, which may be a conditional probability distribution model of another set of output random variables given a set of input random variables, is characterized by assuming that the output random variables constitute a markov random field.
Example 1
There is also provided in accordance with an embodiment of the present application a data processing method, it being noted that the steps shown in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order other than that shown or described herein.
The method according to the first embodiment of the present application may be implemented in a mobile terminal, a computer terminal or a similar computing device. Fig. 1 shows a block diagram of a hardware architecture of a computer terminal (or mobile device) for implementing a data processing method. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …,102 n) processors 102 (the processors 102 may include, but are not limited to, a microprocessor MCU, a programmable logic device FPGA, etc. processing means), a memory 104 for storing data, and a transmission means 106 for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or mobile device). As referred to in embodiments of the application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination connected to the interface).
The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the data processing method in the embodiment of the present application, and the processor 102 executes the software programs and modules stored in the memory 104, thereby executing various functional applications and data processing, that is, implementing the data processing method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).
In the above-described operating environment, the present application provides a data processing method as shown in fig. 2. Fig. 2 is a flowchart of a data processing method according to embodiment 1 of the present application. As shown in fig. 2, the method may include the steps of:
Step S202, an original address text is obtained, wherein errors exist in the original address text;
the original address text in the above step may be an address text obtained by recognizing the photographed image through OCR, and the recognized address text may have errors due to noise such as illumination, crease, etc., and belongs to noise address text. The address text may be in different languages such as chinese and english, and in the embodiment of the present application, a chinese address text is illustrated as an example.
The proposal provided by the embodiment of the invention mainly corrects the most important administrative division part in the address text, so that the administrative division part in the original address text has errors.
In an alternative embodiment, the user may take a picture of the address information, recognize the corresponding original noise address text by OCR, and upload the original noise address text to the server. In another alternative embodiment, the user may take a picture of the address information and upload the image to a server, which recognizes the corresponding original noise address text by OCR.
For example, taking the scenario shown in fig. 3 as an example, the server may be used as a data processing device to process the original address text by using the method provided by the embodiment of the present application, where the process flow is shown in fig. 4. The user can send the original noise address text 'Zhehong Kangzhou Wulin square' to the data processing equipment, and the data processing equipment takes the received 'Zhehong Kangzhou Wulin square' as the original address text and processes the original address text.
Step S204, word segmentation processing is carried out on the original address text by utilizing the trained sequence labeling model, and an original first granularity geographical area and an original second granularity geographical area are obtained;
the trained sequence labeling model in the above steps can be RNN, biLSTM, biLSTM-CRF and other models, and in the embodiment of the application, the BiLSTM-CRF model is taken as an example for explanation, and the sequence labeling can be well performed by combining a two-way long-short-term memory network (BiLSTM) with a Conditional Random Field (CRF). The original first granularity geographical area in the above step may be original administrative division text in the original address text, and the original second granularity geographical area may be detailed address text in the original address text.
For example, still taking the scenario shown in fig. 3 as an example, after receiving the original address text "Zhehong kangzhou martial arts", the data processing apparatus first performs word segmentation processing on the original address text by using a sequence labeling model, and extracts an original first granularity geographical region (i.e., a political region part) "Zhehong kangzhou" and an original second granularity geographical region (i.e., a detailed address part) "martial arts".
Step S206, processing the original first granularity geographical area by using the trained text generation model to obtain a target first granularity geographical area, wherein the target first granularity geographical area is correct;
The text generation model in the above step may be a model of Seq2Seq, N-gram, etc., and in the embodiment of the present application, a model of Seq2Seq with attention mechanism is described as an example. The target administrative region may be an administrative region text generated by correcting the noise administrative region text.
For example, still taking the scenario shown in fig. 3 as an example, after the user confirms that the word segmentation is correct, the data processing apparatus may process the original first granularity geographical region "Zhehongzhuzhou" by using the text generation model to obtain the target first granularity geographical region "Zhejiangzhou", where the obtained target first granularity geographical region "Zhejiangzhou" is completely correct.
Step S208, generating a target address text based on the target first granularity geographical area and the original second granularity geographical area.
In an alternative embodiment, after the original noise address text is obtained, the original noise address text may be sequence tagged using BiLSTM-CRF model, dividing the address into an administrative section and a detailed address section. After the original noise address text is divided, a Seq2Seq model with attention mechanism can be used for administrative region text to generate an error corrected administrative region part. Further by combining the administrative division part for the error with the detailed address part, a corrected address text can be obtained and the original noise address text can be replaced with the corrected address text.
For example, still taking the scenario shown in fig. 3 as an example, the data processing apparatus may combine the target first granularity geographical area "kangzhou" and the original second granularity geographical area "kanka square" to obtain the final target address text (i.e., the corrected address text) "kangzhou kanka square" and output the target address text to the user for viewing.
It should be noted that, the data processing device may output the original first granularity geographical area, the original second granularity geographical area, and the target first granularity geographical area to the user for checking and confirming, and the user confirms whether the word segmentation is correct and whether the error correction result is correct.
The embodiment of the application can be applied to the field of newly manufactured material delivery, the material delivery address can be divided into two sections of addresses, the trained sequence labeling model can be utilized to perform word segmentation processing on the material delivery address to obtain the two sections of addresses, and the trained text generating model is utilized to process the two sections of addresses, so that the purpose of correcting errors of the two sections of addresses is achieved.
According to the scheme provided by the embodiment of the application, after the original address text with errors is obtained, firstly, word segmentation processing is carried out on the original address text by utilizing a trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area, then, the original first granularity geographical area is processed by utilizing a trained text generation model to obtain an error-corrected target first granularity geographical area, and further, the target address text can be obtained by combining the target first granularity geographical area and the original second granularity geographical area, so that the aim of address error correction is fulfilled. Compared with the related art, the administrative division can be extracted through sequence labeling, then the noise administrative division is generated through text generation to the clean administrative division, the original address text is not required to have a clean sub-address, and the whole administrative division text can be corrected, so that the effect of improving the accuracy of address correction is achieved, the problem that the data processing method in the related art is realized in a mode of searching the clean address text in the address library is solved, and if the address library does not contain the corresponding clean address text, the technical problem of lower processing accuracy is caused.
In the above embodiment of the present application, word segmentation processing is performed on an original address text by using a trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area, including: processing the original address text to obtain a first feature vector of the original address text; processing the first feature vector by using a two-way long-short-term memory network to obtain a probability matrix of the original address text, wherein the probability matrix comprises: a probability set for each first feature in the first feature vector, the probability set comprising: a plurality of labels, and probability values corresponding to each label; processing the probability matrix by using a conditional random field to obtain a target labeling sequence of the original address text, wherein the target labeling sequence comprises: a target tag for each first feature; and dividing the original address text based on the target labeling sequence to obtain an original first granularity geographical area and an original second granularity geographical area.
The first feature vector in the above step may be a word vector of the original address text, each element in the vector representing a word in the original address text (i.e. the first feature described above). In the embodiment of the present application, the address text needs to be divided into an administrative division part and a detailed address part, so the label in the above steps may mainly include: administrative division part labels (denoted DIV), detailed address part labels (denoted DET) and other labels (denoted O), and in addition, for each word in the address text, it may be located in the head (denoted B), middle (denoted I) or tail (denoted B) of the different part.
The bidirectional LSTM encodes the sequence from both the forward and reverse directions, which can better encode context semantics. For LSTM, the problem that the traditional RNN cannot handle long-distance dependence can be successfully solved by using three gate structures of a forgetting gate f, an input gate i and an output gate o, wherein the forgetting gate f decides how much history information to keep, the input gate i decides how much new information to add, and the output gate o controls final output. The LSTM update formula is as follows:
ft=σ(Wf·[ht-1,xt]+bf),
it=σ(Wi·[ht-1,xt]+bi),
ot=σ(Wo·[ht-1,xt]+bo),
ht=ot*tanh(Ct),
Where h t is the output of the current step, C t is the state of the current step, and W f、bf、Wi、bi、Wo、bo、WC、bC is the network parameter.
It should be noted that, after the conventional sequence labeling based on LSTM is encoded by using LSTM, the tag of the current position is predicted by a softmax layer, and the method has the disadvantage that only the tag with the largest probability of the current step is considered, no influence exists between each step of output, and the probability of the tag sequence is not considered globally. The conditional random field can well maximize the probability of considering the sequence from the global, so that the BiLSTM-CRF model adopted in the embodiment of the application can combine the bidirectional LSTM with the conditional random field CRF, and after the original sequence is encoded by using the bidirectional LSTM, the global loss function is calculated through a CRF layer, so that the probability of the output tag sequence is maximized.
For example, taking the address error correction flow shown in fig. 4 as an example, after the original sequence "zhe kangzhou martial arts" is obtained, the vector corresponding to the original sequence may be obtained and input into the BiLSTM model, and the coding result of each word may be obtained by using the BiLSTM model coding, that is, the probability value of all the tags is obtained for each word, and as shown in fig. 4, for 7 tags such as B-DIV, I-DIV, E-DIV, B-DET, I-DET, E-DET, and O, the coding result of "Zhe" is [1.5 1.2 0.5 1.0 0.4 0.3 0.8], the coding result of "red" is [1.1 1.0 1.0 0.8 0.4 0.3 0.6], the coding result of "kang" is [0.7 0.6 0.7 1.1 1.1 1.3 0.3], the coding result of "state" is [0.4 0.6 0.2 1.2 1.5 1.5 0.9], the coding result of "martial arts" is [0.8 0.7 0.6 1.7 1.3 1.1 1.0], the coding result of "forest" is [0.3 0.5 0.7 1.3 1.5 1.0 0.9], the coding result of "broad" is [0.6 1.0 0.9 0.8 1.2 1.4 1.1], and the coding result of "field" is [0.3 0.6 0.9 0.2 1.5 1.4 0.9].
Then, all the coding results are processed through a CRF layer, probability values of various different coding combinations are calculated, and the coding combination corresponding to the maximum probability value is output, as shown in fig. 4, combination 1: B-DIV, E-DIV, B-DET, I-DET, E-DET, the probability value of the combination is 0.8; combination 2: B-DIV, I-DIV, E-DIV, B-DET, I-DET, E-DET, the probability value of the combination is 1.1, etc. The code combination corresponding to the maximum probability value is combination 2.
Further, based on the code combination of CRF layer output, the original sequence can be divided, and the administrative division part is extracted, for combination 2: B-DIV, I-DIV, E-DIV, B-DET, I-DET, E-DET, the first four words can be determined to be the administrative division part, and the last four words are the detailed address part, so that the administrative division text can be obtained as follows: the Zhehongzhizhou, detailed address text is: martial arts square.
In the above embodiment of the present application, the method further includes: acquiring a plurality of first address texts, wherein each first address text is correct; generating a plurality of first training data based on the plurality of first address texts, wherein each first training data comprises: the second address text corresponding to each first address text and the labeling sequence of the second address text, and the second address text has errors; and training the sequence labeling model by using a plurality of first training data to obtain a trained sequence labeling model.
Aiming at BiLSTM-CRF model, in order to divide address text more accurately, a large number of address texts need to be collected for training, the address texts comprise clean address texts and address texts with errors, wherein the clean address texts are more easy to collect compared with the address texts with errors. In an alternative embodiment, to improve the training efficiency of the BiLSTM-CRF model, the composite training data may be enhanced by data, i.e. only a large amount of clean address text is collected, and the final training data is obtained by adding different noise to the address text.
In the above embodiment of the present application, generating a plurality of first training data based on a plurality of first address texts includes: filtering the first address text based on the first granularity geographical region vocabulary to obtain a first granularity geographical region and a second granularity geographical region; noise processing is carried out on the first granularity geographical area and the second granularity geographical area respectively, so that a processed first granularity geographical area and a processed second granularity geographical area are obtained; and generating a second address text based on the processed first granularity geographical region and the processed second granularity geographical region.
The first granularity geographical area vocabulary in the above step may be an administrative division vocabulary including detailed information of all levels of administrative division, for example, for Zhejiang province, it includes 11 regional level administrative divisions in total, and for Hangzhou city under jurisdiction, it includes 13 county level administrative divisions in total.
In an alternative embodiment, after a large amount of clean address text is obtained, the administrative division part is filtered according to the administrative division vocabulary, and the rest is the detailed address part, then noise is added to the administrative division part and the detailed address part, for example, the administrative division part and the detailed address part are subjected to processing such as few words, multiple words, and wrong words, and finally the two parts of content are spliced to generate final training data.
In the above embodiment of the present application, processing an original first granularity geographical area by using a trained text generation model to obtain a target first granularity geographical area includes: processing the original first granularity geographical area to obtain a second feature vector of the original first granularity geographical area; processing the second feature vector by using an encoder to obtain a target vector of the original first granularity geographic area; and processing the target vector and the historical output result corresponding to the target vector by using a decoder to obtain the target first granularity geographical region.
The second feature vector in the above step may be a word vector of the original administrative division text, each element in the vector representing a word in the original administrative division text. The target vector may be the output of the current step of the encoder. The historical output result may be the output result of the last step of the decoder.
The Seq2Seq model with attention mechanism fuses the attention mechanism into the Seq2Seq model, which can better be based on sequence-to-sequence generation. Seq2Seq is a network architecture comprising an encoder and a decoder, the encoder encoding a variable-length input sequence into a fixed-length vector representation, and then passing this vector through a decoder to generate an output sequence, the encoder and decoder typically employing RNN networks and variants thereof.
The conventional Seq2Seq compresses the input sequence into a vector representation of fixed length, so that when the input sequence is long, the vector representation does not characterize the input sequence well, resulting in a significant drop in model efficiency. To address this problem, attention mechanisms are introduced into the Seq2Seq network architecture. The attention mechanism introduces a context vector c i,ci that depends on the information of the whole input sequence (h 1,…,hT), the calculation formula is as follows:
Where s i denotes the current decoder output information, h j denotes the output information of the j-th step of the input sequence, v a,Wa and U a are network parameters.
The attention mechanism enables the output of each step of the decoder to not only rely on the vector representation encoded by the input sequence, but also on the output of each step of the input sequence, thus solving the long sequence problem well. Because the input is noisy text and no word segmentation can be performed, the present solution uses words as model inputs and the output must be words in the administrative division vocabulary.
For example, still taking the address error correction flow shown in fig. 4 as an example, after the "Zhehong Zhuzhou" administrative division sequence is obtained, the vector corresponding to the administrative division sequence may be obtained and input into the BiLSTM model, and the BiLSTM model outputs one output information h j at each time point, the output corresponding to "Zhejiang" is 0.6, the output corresponding to "red" is 0.2, the output corresponding to "Zhejiang" is 0.1, and the output corresponding to "Zhehong" is 0.1, so as to obtain all output results.
The initial time step input from the decoder is from the symbol "< s >", and for a sequence in one output, the output sequence is completed when the decoder searches for the "</s >" symbol at a time step. The decoder obtains a 'Zhejiang' based on all output results output by the encoder and context vectors obtained by calculation of all output results, and then obtains a 'Zhejiang' output result based on the 'Zhejiang' and context vectors obtained by calculation of all output results of the encoder; the second step is based on the output result 'Zhejiang' of the first step and context vector obtained through calculation of all output results of the encoder to obtain 'Hangzhou', and then based on the output result 'Hangzhou' of the third step and context vector obtained through calculation of all output results of the encoder to obtain the output result 'Hangzhou' of the second step; and thirdly, based on the output result 'Hangzhou' of the second step and the context vector obtained through calculation of all output results of the encoder, obtaining the output result's >', finishing the output sequence, and determining that the administrative division text after error correction is 'Hangzhou Zhejiang'.
In the above embodiment of the present application, the method further includes: obtaining a first granularity geographical area vocabulary, wherein the first granularity geographical area vocabulary comprises: a plurality of first granularity geographical area words, each first granularity geographical area word being correct; generating a plurality of second training data based on the first granularity geographical region vocabulary, wherein each second training data comprises: positive and negative samples of the first granularity geographic area, the positive sample being correct and there being an error in the negative sample; and training the text generation model by using the plurality of second training data to obtain a trained text generation model.
Alternatively, the type of the first granularity geographic area word may include one of: name, suffix, and minority.
The first granularity geographical area word in the above step may be an administrative division word in an administrative division word list, the positive sample may be correct administrative division text, and the negative sample may be wrong administrative division text.
Because the address text generally has the aliasing phenomenon, in the embodiment of the application, the standard administrative division words are divided into three parts of administrative division names, administrative division suffixes and minority nations to be used as a final output word list, so that word level errors can not occur in output, the administrative division aliases in the original text can not be corrected into the standard administrative division, and the original appearance of the address is reserved to the greatest extent.
Likewise, in an alternative embodiment, for the Seq2Seq model with attention mechanism, the synthetic training data may also be enhanced by data, i.e. starting from the administrative division vocabulary, different administrative division combinations are generated by permutation and combination, and the final training data is obtained by adding different noise to the combinations.
In the above embodiment of the present application, generating a plurality of second training data based on the first granularity geographical area vocabulary includes: arranging and combining a plurality of words in a geographic area with a first granularity to obtain a positive sample; noise processing is carried out on the positive sample to obtain a negative sample; second training data is generated based on the positive and negative samples.
In an alternative embodiment, after the administrative division vocabulary is obtained, all possible administrative division combinations are first arranged and combined, then noise is added to each combination, the noise types include few words, multiple words, wrong words, few administrative division suffixes, few ethnicities, and the like, and finally the original clean administrative division and the administrative division after noise addition are combined into the input and output training data for error correction.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.
Example 2
There is also provided in accordance with an embodiment of the present application a data processing method, it being noted that the steps shown in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order other than that shown or described herein.
Fig. 5 is a flowchart of a data processing method according to embodiment 2 of the present application. As shown in fig. 5, the method may include the steps of:
step S502, an original address text sent by a client is obtained, wherein an error exists in the original address text;
The client in the above steps may be a mobile terminal such as a smart phone (including an Android mobile phone and an IOS mobile phone), a tablet computer, a palm computer, an IPAD, a notebook computer, and a personal computer of the user, but is not limited thereto, and may be other devices capable of capturing the address information.
Step S504, word segmentation processing is carried out on the original address text by utilizing the trained sequence labeling model, and an original first granularity geographical area and an original second granularity geographical area are obtained;
step S506, the original first granularity geographical area is processed by using the trained text generation model, and a target first granularity geographical area is obtained, wherein the target first granularity geographical area is correct;
Step S508, generating a target address text based on the target first granularity geographical area and the original second granularity geographical area;
step S510, the target address text is sent to the client.
In an alternative embodiment, after correcting the original noise address text, the server may return the corrected address text to the client, so that the user may view the corrected address text, and tracking of subsequent logistics nodes is facilitated.
In the above embodiment of the present application, obtaining an original address text sent by a client includes: receiving an image which is sent by a client and contains an original address text; and carrying out image recognition on the image to obtain an original address text.
In an alternative embodiment, the user may take a picture of the address information through the client, recognize the corresponding original noise address text through OCR, and upload the original noise address text to the server. In another alternative embodiment, the user may take a picture of the address information through the client and upload the image to the server, which recognizes the corresponding original noise address text through OCR.
In the above embodiment of the present application, word segmentation processing is performed on an original address text by using a trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area, including: processing the original address text to obtain a first feature vector of the original address text; processing the first feature vector by using a two-way long-short-term memory network to obtain a probability matrix of the original address text, wherein the probability matrix comprises: a probability set for each first feature in the first feature vector, the probability set comprising: a plurality of labels, and probability values corresponding to each label; processing the probability matrix by using a conditional random field to obtain a target labeling sequence of the original address text, wherein the target labeling sequence comprises: a target tag for each first feature; and dividing the original address text based on the target labeling sequence to obtain an original first granularity geographical area and an original second granularity geographical area.
In the above embodiment of the present application, the method further includes: acquiring a plurality of first address texts, wherein each first address text is correct; generating a plurality of first training data based on the plurality of first address texts, wherein each first training data comprises: the second address text corresponding to each first address text and the labeling sequence of the second address text, and the second address text has errors; and training the sequence labeling model by using a plurality of first training data to obtain a trained sequence labeling model.
In the above embodiment of the present application, generating a plurality of first training data based on a plurality of first address texts includes: filtering the first address text based on the first granularity geographical region vocabulary to obtain a first granularity geographical region and a second granularity geographical region; noise processing is carried out on the first granularity geographical area and the second granularity geographical area respectively, so that a processed first granularity geographical area and a processed second granularity geographical area are obtained; and generating a second address text based on the processed first granularity geographical region and the processed second granularity geographical region.
In the above embodiment of the present application, processing an original first granularity geographical area by using a trained text generation model to obtain a target first granularity geographical area includes: processing the original first granularity geographical area to obtain a second feature vector of the original first granularity geographical area; processing the second feature vector by using an encoder to obtain a target vector of the original first granularity geographic area; and processing the target vector and the historical output result corresponding to the target vector by using a decoder to obtain the target first granularity geographical region.
In the above embodiment of the present application, the method further includes: obtaining a first granularity geographical area vocabulary, wherein the first granularity geographical area vocabulary comprises: a plurality of first granularity geographical area words, each first granularity geographical area word being correct; generating a plurality of second training data based on the first granularity geographical region vocabulary, wherein each second training data comprises: positive and negative samples of the first granularity geographic area, the positive sample being correct and there being an error in the negative sample; and training the text generation model by using the plurality of second training data to obtain a trained text generation model.
In the above embodiment of the present application, generating a plurality of second training data based on the first granularity geographical area vocabulary includes: arranging and combining a plurality of words in a geographic area with a first granularity to obtain a positive sample; noise processing is carried out on the positive sample, so that a negative sample is obtained; second training data is generated based on the positive and negative samples.
It should be noted that, the preferred embodiment of the present application in the foregoing embodiment is the same as the scheme provided in embodiment 1, the application scenario and the implementation process, but is not limited to the scheme provided in embodiment 1.
Example 3
There is also provided in accordance with an embodiment of the present application a data processing method, it being noted that the steps shown in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order other than that shown or described herein.
Fig. 6 is a flowchart of a data processing method according to embodiment 3 of the present application. As shown in fig. 6, the method may include the steps of:
Step S602, triggering a client to generate a processing instruction;
the processing instruction in the above step may be an instruction generated by the user operating the client when the user needs to perform address correction, for example, an instruction triggered by the user to take a photograph, or an instruction triggered by the user when the user needs to upload the address text to the server.
Step S604, the client acquires an original address text based on the processing instruction, wherein an error exists in the original address text;
For example, as shown in fig. 7, the client provides an operation interface for the user, and the user may click on the "photograph" button to trigger the client to photograph the address text, or the user may click on the "album" button to trigger the client to open the album, and the user selects the photograph containing the address text in the album. After the above steps are completed, the user uploads the photograph containing the address text or the identified address text to the server by clicking the "address correction" button.
Step S606, the client sends an original address text to the server and receives a target address text returned by the server, wherein the target address text is generated based on a target first granularity geographical area and an original second granularity geographical area, the target first granularity geographical area is obtained by processing the original first granularity geographical area by using a trained text generation model, and the original first granularity geographical area and the original second granularity geographical area are obtained by word segmentation of the original address text by using a trained sequence labeling model, and the target first granularity geographical area is correct;
the server in the above step may be a server capable of performing an address text error correction function, for example, may be a cloud server, but is not limited thereto.
In step S608, the client outputs the target address text.
In an alternative embodiment, after correcting the original noise address text, the server may return the corrected address text to the client, so that the user may view the corrected address text, and tracking of subsequent logistics nodes is facilitated.
For example, as shown in fig. 7, the error corrected address text may be displayed in the operation interface.
In the above embodiment of the present application, the client obtains the original address text based on the processing instruction, including: the client acquires an image containing an original address text based on a processing instruction; and the client performs image recognition on the image to obtain an original address text.
In the above embodiment of the present application, the original first granularity geographical area and the original second granularity geographical area are obtained by processing the original address text by using a trained sequence labeling model in the following manner: processing the original address text to obtain a first feature vector of the original address text; processing the first feature vector by using a two-way long-short-term memory network to obtain a probability matrix of the original address text, wherein the probability matrix comprises: a probability set for each first feature in the first feature vector, the probability set comprising: a plurality of labels, and probability values corresponding to each label; processing the probability matrix by using a conditional random field to obtain a target labeling sequence of the original address text, wherein the target labeling sequence comprises: a target tag for each first feature; and dividing the original address text based on the target labeling sequence to obtain an original first granularity geographical area and an original second granularity geographical area.
In the above embodiment of the present application, the trained sequence annotation model is obtained by the following method: acquiring a plurality of first address texts, wherein each first address text is correct; generating a plurality of first training data based on the plurality of first address texts, wherein each first training data comprises: the second address text corresponding to each first address text and the labeling sequence of the second address text, and the second address text has errors; and training the sequence labeling model by using a plurality of first training data to obtain a trained sequence labeling model.
In the above embodiment of the present application, the plurality of first training data is generated based on the plurality of first address texts by: filtering the first address text based on the first granularity geographical region vocabulary to obtain a first granularity geographical region and a second granularity geographical region; noise processing is carried out on the first granularity geographical area and the second granularity geographical area respectively, so that a processed first granularity geographical area and a processed second granularity geographical area are obtained; and generating a second address text based on the processed first granularity geographical region and the processed second granularity geographical region.
In the above embodiment of the present application, the target first granularity geographical area is obtained by processing the original first granularity geographical area by using the trained text generation model in the following manner: processing the original first granularity geographical area to obtain a second feature vector of the original first granularity geographical area; processing the second feature vector by using an encoder to obtain a target vector of the original first granularity geographic area; and processing the target vector and the historical output result corresponding to the target vector by using a decoder to obtain the target first granularity geographical region.
In the above embodiment of the present application, the trained text generation model is obtained by: obtaining a first granularity geographical area vocabulary, wherein the first granularity geographical area vocabulary comprises: a plurality of first granularity geographical area words, each first granularity geographical area word being correct; generating a plurality of second training data based on the first granularity geographical region vocabulary, wherein each second training data comprises: positive and negative samples of the first granularity geographic area, the positive sample being correct and there being an error in the negative sample; and training the text generation model by using the plurality of second training data to obtain a trained text generation model.
In the above embodiment of the present application, the plurality of second training data is generated based on the first granularity geographical area vocabulary by: arranging and combining a plurality of words in a geographic area with a first granularity to obtain a positive sample; noise processing is carried out on the positive sample to obtain a negative sample; second training data is generated based on the positive and negative samples.
It should be noted that, the preferred embodiment of the present application in the foregoing embodiment is the same as the scheme provided in embodiment 1, the application scenario and the implementation process, but is not limited to the scheme provided in embodiment 1.
Example 4
There is also provided in accordance with an embodiment of the present application a data processing method, it being noted that the steps shown in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order other than that shown or described herein.
Fig. 8 is a flowchart of a data processing method according to embodiment 4 of the present application. As shown in fig. 8, the method may include the steps of:
Step S802, triggering a client to generate a processing instruction;
step S804, the client obtains an image containing an original address text based on the processing instruction, wherein an error exists in the original address text;
For example, as shown in fig. 7, the client provides an operation interface for the user, and the user may click on the "photograph" button to trigger the client to photograph the address text, or the user may click on the "album" button to trigger the client to open the album, and the user selects the photograph containing the address text in the album. After the above steps are completed, the user uploads the photo containing the address text to the server by clicking the "address correction" button.
Step S806, the client sends an image to the server and receives a target address text returned by the server, wherein the image is identified by the server, the target address text is generated based on a target first granularity geographical area and an original second granularity geographical area, the target first granularity geographical area is obtained by processing the original first granularity geographical area by using a trained text generation model, the original first granularity geographical area and the original second granularity geographical area are obtained by word segmentation of the original address text by using a trained sequence annotation model, and the target first granularity geographical area is correct;
in step S808, the client outputs the target address text.
In the above embodiment of the present application, the original first granularity geographical area and the original second granularity geographical area are obtained by processing the original address text by using a trained sequence labeling model in the following manner: processing the original address text to obtain a first feature vector of the original address text; processing the first feature vector by using a two-way long-short-term memory network to obtain a probability matrix of the original address text, wherein the probability matrix comprises: a probability set for each first feature in the first feature vector, the probability set comprising: a plurality of labels, and probability values corresponding to each label; processing the probability matrix by using a conditional random field to obtain a target labeling sequence of the original address text, wherein the target labeling sequence comprises: a target tag for each first feature; and dividing the original address text based on the target labeling sequence to obtain an original first granularity geographical area and an original second granularity geographical area.
In the above embodiment of the present application, the trained sequence annotation model is obtained by the following method: acquiring a plurality of first address texts, wherein each first address text is correct; generating a plurality of first training data based on the plurality of first address texts, wherein each first training data comprises: the second address text corresponding to each first address text and the labeling sequence of the second address text, and the second address text has errors; and training the sequence labeling model by using a plurality of first training data to obtain a trained sequence labeling model.
In the above embodiment of the present application, the plurality of first training data is generated based on the plurality of first address texts by: filtering the first address text based on the first granularity geographical region vocabulary to obtain a first granularity geographical region and a second granularity geographical region; noise processing is carried out on the first granularity geographical area and the second granularity geographical area respectively, so that a processed first granularity geographical area and a processed second granularity geographical area are obtained; and generating a second address text based on the processed first granularity geographical region and the processed second granularity geographical region.
In the above embodiment of the present application, the target first granularity geographical area is obtained by processing the original first granularity geographical area by using the trained text generation model in the following manner: processing the original first granularity geographical area to obtain a second feature vector of the original first granularity geographical area; processing the second feature vector by using an encoder to obtain a target vector of the original first granularity geographic area; and processing the target vector and the historical output result corresponding to the target vector by using a decoder to obtain the target first granularity geographical region.
In the above embodiment of the present application, the trained text generation model is obtained by: obtaining a first granularity geographical area vocabulary, wherein the first granularity geographical area vocabulary comprises: a plurality of first granularity geographical area words, each first granularity geographical area word being correct; generating a plurality of second training data based on the first granularity geographical region vocabulary, wherein each second training data comprises: positive and negative samples of the first granularity geographic area, the positive sample being correct and there being an error in the negative sample; and training the text generation model by using the plurality of second training data to obtain a trained text generation model.
In the above embodiment of the present application, the plurality of second training data is generated based on the first granularity geographical area vocabulary by: arranging and combining a plurality of words in a geographic area with a first granularity to obtain a positive sample; noise processing is carried out on the positive sample to obtain a negative sample; second training data is generated based on the positive and negative samples.
It should be noted that, the preferred embodiment of the present application in the foregoing embodiment is the same as the scheme provided in embodiment 1, the application scenario and the implementation process, but is not limited to the scheme provided in embodiment 1.
Example 5
According to an embodiment of the present application, there is also provided a data processing apparatus for implementing the above data processing method, as shown in fig. 9, the apparatus 900 includes: a first acquisition module 902, a first processing module 904, a second processing module 906, and a first generation module 908.
The first obtaining module 902 is configured to obtain an original address text, where an error exists in the original address text; the first processing module 904 is configured to perform word segmentation processing on the original address text by using the trained sequence labeling model, so as to obtain an original first granularity geographical area and an original second granularity geographical area; the second processing module 906 is configured to process the original first granularity geographical area by using the trained text generation model to obtain a target first granularity geographical area, where the target first granularity geographical area is correct; the first generation module 908 is configured to generate target address text based on the target first granularity geographical area and the original second granularity geographical area.
Here, it should be noted that the first obtaining module 902, the first processing module 904, the second processing module 906, and the first generating module 908 correspond to steps S202 to S208 in embodiment 1, and the four modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.
In the above embodiment of the present application, the first processing module includes: the first processing unit is used for processing the original address text to obtain a first feature vector of the original address text; the second processing unit is used for processing the first feature vector by utilizing the two-way long-short-term memory network to obtain a probability matrix of the original address text, wherein the probability matrix comprises: a probability set for each first feature in the first feature vector, the probability set comprising: a plurality of labels, and probability values corresponding to each label; the third processing unit is used for processing the probability matrix by using the conditional random field to obtain a target labeling sequence of the original address text, wherein the target labeling sequence comprises the following steps: a target tag for each first feature; the dividing unit is used for dividing the original address text based on the target labeling sequence to obtain an original first granularity geographical area and an original second granularity geographical area.
In the above embodiment of the present application, the apparatus further includes: the second acquisition module is used for acquiring a plurality of first address texts, wherein each first address text is correct; the second generating module is used for generating a plurality of first training data based on a plurality of first address texts, wherein each first training data comprises: the second address text corresponding to each first address text and the labeling sequence of the second address text, and the second address text has errors; and the first training module is used for training the sequence annotation model by utilizing the plurality of first training data to obtain a trained sequence annotation model.
In the above embodiment of the present application, the second generating module includes: the filtering unit is used for filtering the first address text based on the word list of the first granularity geographical area to obtain the first granularity geographical area and the second granularity geographical area; the fourth processing unit is used for respectively carrying out noise processing on the first granularity geographical area and the second granularity geographical area to obtain a processed first granularity geographical area and a processed second granularity geographical area; the first generation unit is used for generating a second address text based on the processed first granularity geographical area and the processed second granularity geographical area.
In the above embodiment of the present application, the second processing module includes: the fifth processing unit is used for processing the original first granularity geographical area to obtain a second feature vector of the original first granularity geographical area; a sixth processing unit, configured to process the second feature vector by using the encoder, to obtain a target vector of the original first granularity geographical area; and the seventh processing unit is used for processing the target vector and the historical output result corresponding to the target vector by using the decoder to obtain the target first granularity geographical region.
In the above embodiment of the present application, the apparatus further includes: the third obtaining module is configured to obtain a first granularity geographical area vocabulary, where the first granularity geographical area vocabulary includes: a plurality of first granularity geographical area words, each first granularity geographical area word being correct; a third generation module, configured to generate a plurality of second training data based on the first granularity geographical region vocabulary, where each second training data includes: positive and negative samples of the first granularity geographic area, the positive sample being correct and there being an error in the negative sample; and the second training module is used for training the text generation model by utilizing a plurality of second training data to obtain a trained text generation model.
In the above embodiment of the present application, the third generating module includes: the combination unit is used for arranging and combining a plurality of words in the geographic area with the first granularity to obtain a positive sample; an eighth processing unit, configured to perform noise processing on the positive sample to obtain a negative sample; and a second generation unit for generating second training data based on the positive and negative samples.
It should be noted that, the preferred embodiment of the present application in the foregoing embodiment is the same as the scheme provided in embodiment 1, the application scenario and the implementation process, but is not limited to the scheme provided in embodiment 1.
Example 6
According to an embodiment of the present application, there is also provided a data processing apparatus for implementing the above data processing method, as shown in fig. 10, the apparatus 1000 includes: a first acquisition module 1002, a first processing module 1004, a second processing module 1006, a first generation module 1008, and a transmission module 1010.
The first obtaining module 1002 is configured to obtain an original address text sent by the client, where an error exists in the original address text; the first processing module 1004 is configured to perform word segmentation processing on the original address text by using a trained sequence labeling model, so as to obtain an original first granularity geographical area and an original second granularity geographical area; the second processing module 906 is configured to process the original first granularity geographical area by using the trained text generation model to obtain a target first granularity geographical area, where the target first granularity geographical area is correct; the first generation module 908 is configured to generate a target address text based on the target first granularity geographical area and the original second granularity geographical area; the sending module 1010 is configured to send the target address text to the client.
Here, the first obtaining module 1002, the first processing module 1004, the second processing module 1006, the first generating module 1008, and the sending module 1010 correspond to steps S502 to S510 in embodiment 2, and the five modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.
In the above embodiment of the present application, the first acquisition module includes: the receiving unit is used for receiving the image which is sent by the client and contains the original address text; and the identification unit is used for carrying out image identification on the image to obtain an original address text.
In the above embodiment of the present application, the first processing module includes: the first processing unit is used for processing the original address text to obtain a first feature vector of the original address text; the second processing unit is used for processing the first feature vector by utilizing the two-way long-short-term memory network to obtain a probability matrix of the original address text, wherein the probability matrix comprises: a probability set for each first feature in the first feature vector, the probability set comprising: a plurality of labels, and probability values corresponding to each label; the third processing unit is used for processing the probability matrix by using the conditional random field to obtain a target labeling sequence of the original address text, wherein the target labeling sequence comprises the following steps: a target tag for each first feature; the dividing unit is used for dividing the original address text based on the target labeling sequence to obtain an original first granularity geographical area and an original second granularity geographical area.
In the above embodiment of the present application, the apparatus further includes: the second acquisition module is used for acquiring a plurality of first address texts, wherein each first address text is correct; the second generating module is used for generating a plurality of first training data based on a plurality of first address texts, wherein each first training data comprises: the second address text corresponding to each first address text and the labeling sequence of the second address text, and the second address text has errors; and the first training module is used for training the sequence annotation model by utilizing the plurality of first training data to obtain a trained sequence annotation model.
In the above embodiment of the present application, the second generating module includes: the filtering unit is used for filtering the first address text based on the word list of the first granularity geographical area to obtain the first granularity geographical area and the second granularity geographical area; the fourth processing unit is used for respectively carrying out noise processing on the first granularity geographical area and the second granularity geographical area to obtain a processed first granularity geographical area and a processed second granularity geographical area; the first generation unit is used for generating a second address text based on the processed first granularity geographical area and the processed second granularity geographical area.
In the above embodiment of the present application, the second processing module includes: the fifth processing unit is used for processing the original first granularity geographical area to obtain a second feature vector of the original first granularity geographical area; a sixth processing unit, configured to process the second feature vector by using the encoder, to obtain a target vector of the original first granularity geographical area; and the seventh processing unit is used for processing the target vector and the historical output result corresponding to the target vector by using the decoder to obtain the target first granularity geographical region.
In the above embodiment of the present application, the apparatus further includes: the third obtaining module is configured to obtain a first granularity geographical area vocabulary, where the first granularity geographical area vocabulary includes: a plurality of first granularity geographical area words, each first granularity geographical area word being correct; a third generation module, configured to generate a plurality of second training data based on the first granularity geographical region vocabulary, where each second training data includes: positive and negative samples of the first granularity geographic area, the positive sample being correct and there being an error in the negative sample; and the second training module is used for training the text generation model by utilizing a plurality of second training data to obtain a trained text generation model.
In the above embodiment of the present application, the third generating module includes: the combination unit is used for arranging and combining a plurality of words in the geographic area with the first granularity to obtain a positive sample; an eighth processing unit, configured to perform noise processing on the positive sample to obtain a negative sample; and a second generation unit for generating second training data based on the positive and negative samples.
It should be noted that, the preferred embodiment of the present application in the foregoing embodiment is the same as the scheme provided in embodiment 1, the application scenario and the implementation process, but is not limited to the scheme provided in embodiment 1.
Example 7
There is further provided, according to an embodiment of the present application, a data processing apparatus for implementing the above data processing method, as shown in fig. 11, the apparatus 1100 including: a triggering module 1102, an acquisition module 1104, a communication module 1106, and an output module 1108.
The triggering module 1102 is configured to trigger the client to generate a processing instruction; the acquiring module 1104 is configured to acquire an original address text based on the processing instruction, where an error exists in the original address text; the communication module 1106 is configured to send an original address text to the server, and receive a target address text returned by the server, where the target address text is generated based on a target first-granularity geographical area and a target second-granularity geographical area, the target first-granularity geographical area is obtained by processing the original first-granularity geographical area by using a trained text generation model, and the original first-granularity geographical area and the original second-granularity geographical area are obtained by word segmentation of the original address text by using a trained sequence labeling model, and the target first-granularity geographical area is correct; the output module 1108 is used to output the target address text.
Here, the triggering module 1102, the acquiring module 1104, the communication module 1106 and the output module 1108 correspond to steps S602 to S608 in embodiment 3, and the four modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.
In the above embodiment of the present application, the obtaining module includes: the acquisition unit is used for the client to acquire an image containing the original address text based on the processing instruction; the identification unit is used for carrying out image identification on the image by the client to obtain an original address text.
It should be noted that, the preferred embodiment of the present application in the foregoing embodiment is the same as the scheme provided in embodiment 1, the application scenario and the implementation process, but is not limited to the scheme provided in embodiment 1.
Example 8
There is further provided, according to an embodiment of the present application, a data processing apparatus for implementing the above data processing method, as shown in fig. 12, the apparatus 1200 including: a triggering module 1202, an acquisition module 1204, a communication module 1206, and an output module 1208.
The triggering module 1202 is configured to trigger the client to generate a processing instruction; the obtaining module 1204 is configured to obtain an image including an original address text based on the processing instruction, where an error exists in the original address text; the communication module 1206 is configured to send an image to the server and receive a target address text returned by the server, where the image is identified by the server, the target address text is generated based on a target first granularity geographical area and an original second granularity geographical area, the target first granularity geographical area is obtained by processing the original first granularity geographical area by using a trained text generation model, the original first granularity geographical area and the original second granularity geographical area are obtained by word segmentation of the original address text by using a trained sequence labeling model, and the target first granularity geographical area is correct; the output module 1208 is configured to output the target address text.
Here, the triggering module 1202, the acquiring module 1204, the communication module 1206 and the output module 1208 correspond to steps S802 to S808 in embodiment 4, and the four modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.
It should be noted that, the preferred embodiment of the present application in the foregoing embodiment is the same as the scheme provided in embodiment 1, the application scenario and the implementation process, but is not limited to the scheme provided in embodiment 1.
Example 9
According to an embodiment of the present application, there is also provided a data processing system including:
A processor; and
A memory, coupled to the processor, for providing instructions to the processor for processing the steps of: acquiring an original address text, wherein an error exists in the original address text; performing word segmentation processing on the original address text by using the trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area; processing the original first granularity geographical area by using the trained text generation model to obtain a target first granularity geographical area, wherein the target first granularity geographical area is correct; a target address text is generated based on the target first granularity geographical region and the original second granularity geographical region.
It should be noted that, the preferred embodiment of the present application in the foregoing embodiment is the same as the scheme provided in embodiment 1, the application scenario and the implementation process, but is not limited to the scheme provided in embodiment 1.
Example 10
There is also provided in accordance with an embodiment of the present application a data processing method, it being noted that the steps shown in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order other than that shown or described herein.
Fig. 13 is a flowchart of a data processing method according to embodiment 10 of the present application. As shown in fig. 13, the method may include the steps of:
Step S1302, obtaining original data;
Alternatively, the raw data may include one of: address text and images. For the original address text, the scheme provided in the embodiment of the present application is the same as that in the embodiment 1, and will not be described herein.
Step S1304, processing the original data by using the trained first model to obtain original first granularity data and original second granularity data;
In an alternative embodiment, the images may be classified using the trained first model to obtain major and minor classes of the images. For example, for an image, its major class may be determined to be apparel, and its minor class may be further determined to be women's apparel. Or for an image, the major class of the image can be determined to be women's clothing, and the minor class can be further determined to be down jackets.
Step S1306, processing the original first granularity data by using the trained second model to obtain target first granularity data;
In the application scenario of image classification, classification of the minor classes needs to be realized on the basis of the major class classification, and in order to ensure accuracy of image classification results, in an alternative embodiment, after the major class classification is performed on the image, the major classes can be corrected by using a trained second model.
It should be noted that the second model may be further nested with a first sub-model and a second sub-model, where an implementation of the first sub-model is the same as the first model, and an implementation of the second sub-model is the same as the second model. For example, after error correction of the major class, the minor class of the image may be re-identified based on the error corrected major class to determine a final classification result.
Step S1308, generating target data based on the target first granularity data and the original second granularity data.
In an alternative embodiment, for an image, after error correction of the major classes, the final classification result of the image may be determined by combining the error corrected major and minor classes.
Based on the scheme provided by the embodiment of the application, after the original data is obtained, firstly, the trained first model is utilized to process the original data to obtain the original first granularity data and the original second granularity data, then, the trained second model is utilized to process the original first granularity data to obtain the target first granularity data after error correction, and further, the target data can be obtained by combining the target first granularity data and the original second granularity data, so that the purpose of data error correction is realized. Compared with the related art, the administrative division can be extracted through sequence labeling, then the noise administrative division is generated through text generation to the clean administrative division, the original address text is not required to have a clean sub-address, and the whole administrative division text can be corrected, so that the effect of improving the accuracy of address correction is achieved, the problem that the data processing method in the related art is realized in a mode of searching the clean address text in the address library is solved, and if the address library does not contain the corresponding clean address text, the technical problem of lower processing accuracy is caused.
Example 11
Embodiments of the present application may provide a computer terminal, which may be any one of a group of computer terminals. Alternatively, in the present embodiment, the above-described computer terminal may be replaced with a terminal device such as a mobile terminal.
Alternatively, in this embodiment, the above-mentioned computer terminal may be located in at least one network device among a plurality of network devices of the computer network.
In this embodiment, the above-mentioned computer terminal may execute the program code of the following steps in the data processing method: acquiring an original address text, wherein an error exists in the original address text; performing word segmentation processing on the original address text by using the trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area; processing the original first granularity geographical area by using the trained text generation model to obtain a target first granularity geographical area, wherein the target first granularity geographical area is correct; a target address text is generated based on the target first granularity geographical region and the original second granularity geographical region.
Alternatively, fig. 14 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 14, the computer terminal a may include: one or more (only one is shown) processors 1402, and a memory 1404.
The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the data processing methods and apparatuses in the embodiments of the present application, and the processor executes the software programs and modules stored in the memory, thereby executing various functional applications and data processing, that is, implementing the data processing methods described above. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located with respect to the processor, which may be connected to terminal a through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: acquiring an original address text, wherein an error exists in the original address text; performing word segmentation processing on the original address text by using the trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area; processing the original first granularity geographical area by using the trained text generation model to obtain a target first granularity geographical area, wherein the target first granularity geographical area is correct; a target address text is generated based on the target first granularity geographical region and the original second granularity geographical region.
Optionally, the above processor may further execute program code for: processing the original address text to obtain a first feature vector of the original address text; processing the first feature vector by using a two-way long-short-term memory network to obtain a probability matrix of the original address text, wherein the probability matrix comprises: a probability set for each first feature in the first feature vector, the probability set comprising: a plurality of labels, and probability values corresponding to each label; processing the probability matrix by using a conditional random field to obtain a target labeling sequence of the original address text, wherein the target labeling sequence comprises: a target tag for each first feature; and dividing the original address text based on the target labeling sequence to obtain an original first granularity geographical area and an original second granularity geographical area.
Optionally, the above processor may further execute program code for: acquiring a plurality of first address texts, wherein each first address text is correct; generating a plurality of first training data based on the plurality of first address texts, wherein each first training data comprises: the second address text corresponding to each first address text and the labeling sequence of the second address text, and the second address text has errors; and training the sequence labeling model by using a plurality of first training data to obtain a trained sequence labeling model.
Optionally, the above processor may further execute program code for: filtering the first address text based on the first granularity geographical region vocabulary to obtain a first granularity geographical region and a second granularity geographical region; noise processing is carried out on the first granularity geographical area and the second granularity geographical area respectively, so that a processed first granularity geographical area and a processed second granularity geographical area are obtained; and generating a second address text based on the processed first granularity geographical region and the processed second granularity geographical region.
Optionally, the above processor may further execute program code for: processing the original first granularity geographical area to obtain a second feature vector of the original first granularity geographical area; processing the second feature vector by using an encoder to obtain a target vector of the original first granularity geographic area; and processing the target vector and the historical output result corresponding to the target vector by using a decoder to obtain the target first granularity geographical region.
Optionally, the above processor may further execute program code for: obtaining a first granularity geographical area vocabulary, wherein the first granularity geographical area vocabulary comprises: a plurality of first granularity geographical area words, each first granularity geographical area word being correct; generating a plurality of second training data based on the first granularity geographical region vocabulary, wherein each second training data comprises: positive and negative samples of the first granularity geographic area, and errors in the negative samples; and training the text generation model by using the plurality of second training data to obtain a trained text generation model.
Optionally, the above processor may further execute program code for: arranging and combining a plurality of words in a geographic area with a first granularity to obtain a positive sample; noise processing is carried out on the positive sample to obtain a negative sample; second training data is generated based on the positive and negative samples.
By adopting the embodiment of the application, a processing scheme of the address text is provided. The administrative division is extracted through sequence labeling, then the noise administrative division is generated through text generation to generate the clean administrative division, a clean sub-address is not needed in the original address text, and error correction can be carried out on the whole administrative division text, so that the effect of improving the accuracy of address error correction is achieved, and the technical problem that the processing accuracy is low if the address library does not contain the corresponding clean address text due to the fact that the data processing method in the related art is realized in a mode of searching the clean address text in the address library is solved.
The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: acquiring an original address text sent by a client, wherein an error exists in the original address text; performing word segmentation processing on the original address text by using the trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area; processing the original first granularity geographical area by using the trained text generation model to obtain a target first granularity geographical area, wherein the target first granularity geographical area is correct; generating a target address text based on the target first granularity geographical area and the original second granularity geographical area; and sending the target address text to the client.
Optionally, the above processor may further execute program code for: receiving an image which is sent by a client and contains an original address text; and carrying out image recognition on the image to obtain an original address text.
The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: triggering a client to generate a processing instruction; acquiring an original address text based on the processing instruction, wherein an error exists in the original address text; the method comprises the steps of sending an original address text to a server, and receiving a target address text returned by the server, wherein the target address text is generated based on a target first granularity geographical area and an original second granularity geographical area, the target first granularity geographical area is obtained by processing the original first granularity geographical area by using a trained text generation model, the original first granularity geographical area and the original second granularity geographical area are obtained by word segmentation of the original address text by using a trained sequence labeling model, and the target first granularity geographical area is correct; and outputting the text of the target address.
Optionally, the above processor may further execute program code for: acquiring an image containing an original address text based on a processing instruction; and carrying out image recognition on the image to obtain an original address text.
The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: triggering a client to generate a processing instruction; acquiring an image containing an original address text based on a processing instruction, wherein an error exists in the original address text; the method comprises the steps of sending an image to a server, and receiving a target address text returned by the server, wherein the image is subjected to image recognition by the server, the target address text is generated based on a target first granularity geographical area and an original second granularity geographical area, the target first granularity geographical area is obtained by processing the original first granularity geographical area by using a trained text generation model, the original first granularity geographical area and the original second granularity geographical area are obtained by word segmentation processing of the original address text by using a trained sequence labeling model, and the target first granularity geographical area is correct; and outputting the text of the target address.
The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: acquiring original data; processing the original data by using the trained first model to obtain original first granularity data and original second granularity data; processing the original first granularity data by using the trained second model to obtain target first granularity data; target data is generated based on the target first granularity data and the original second granularity data.
It will be appreciated by those skilled in the art that the configuration shown in fig. 14 is merely illustrative, and the computer terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile internet device (Mobile INTERNET DEVICES, MID), a PAD, etc. Fig. 14 is not limited to the structure of the electronic device. For example, the computer terminal a may also include more or fewer components (such as a network interface, a display device, etc.) than shown in fig. 14, or have a different configuration than shown in fig. 14.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.
Example 12
The embodiment of the application also provides a storage medium. Alternatively, in this embodiment, the storage medium may be used to store the program code executed by the data processing method provided in the first embodiment.
Alternatively, in this embodiment, the storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: acquiring an original address text, wherein an error exists in the original address text; performing word segmentation processing on the original address text by using the trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area; processing the original first granularity geographical area by using the trained text generation model to obtain a target first granularity geographical area, wherein the target first granularity geographical area is correct; a target address text is generated based on the target first granularity geographical region and the original second granularity geographical region.
Optionally, the above storage medium is further configured to store program code for performing the steps of: processing the original address text to obtain a first feature vector of the original address text; processing the first feature vector by using a two-way long-short-term memory network to obtain a probability matrix of the original address text, wherein the probability matrix comprises: a probability set for each first feature in the first feature vector, the probability set comprising: a plurality of labels, and probability values corresponding to each label; processing the probability matrix by using a conditional random field to obtain a target labeling sequence of the original address text, wherein the target labeling sequence comprises: a target tag for each first feature; and dividing the original address text based on the target labeling sequence to obtain an original first granularity geographical area and an original second granularity geographical area.
Optionally, the above storage medium is further configured to store program code for performing the steps of: acquiring a plurality of first address texts, wherein each first address text is correct; generating a plurality of first training data based on the plurality of first address texts, wherein each first training data comprises: the second address text corresponding to each first address text and the labeling sequence of the second address text, and the second address text has errors; and training the sequence labeling model by using a plurality of first training data to obtain a trained sequence labeling model.
Optionally, the above storage medium is further configured to store program code for performing the steps of: filtering the first address text based on the first granularity geographical region vocabulary to obtain a first granularity geographical region and a second granularity geographical region; noise processing is carried out on the first granularity geographical area and the second granularity geographical area respectively, so that a processed first granularity geographical area and a processed second granularity geographical area are obtained; and generating a second address text based on the processed first granularity geographical region and the processed second granularity geographical region.
Optionally, the above storage medium is further configured to store program code for performing the steps of: processing the original first granularity geographical area to obtain a second feature vector of the original first granularity geographical area; processing the second feature vector by using an encoder to obtain a target vector of the original first granularity geographic area; and processing the target vector and the historical output result corresponding to the target vector by using a decoder to obtain the target first granularity geographical region.
Optionally, the above storage medium is further configured to store program code for performing the steps of: obtaining a first granularity geographical area vocabulary, wherein the first granularity geographical area vocabulary comprises: a plurality of first granularity geographical area words, each first granularity geographical area word being correct; generating a plurality of second training data based on the first granularity geographical region vocabulary, wherein each second training data comprises: positive and negative samples of the first granularity geographic area, the positive sample being correct and there being an error in the negative sample; and training the text generation model by using the plurality of second training data to obtain a trained text generation model.
Optionally, the above storage medium is further configured to store program code for performing the steps of: arranging and combining a plurality of words in a geographic area with a first granularity to obtain a positive sample; noise processing is carried out on the positive sample to obtain a negative sample; second training data is generated based on the positive and negative samples.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: acquiring an original address text sent by a client, wherein an error exists in the original address text; performing word segmentation processing on the original address text by using the trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area; processing the original first granularity geographical area by using the trained text generation model to obtain a target first granularity geographical area, wherein the target first granularity geographical area is correct; generating a target address text based on the target first granularity geographical area and the original second granularity geographical area; and sending the target address text to the client.
Optionally, the above storage medium is further configured to store program code for performing the steps of: receiving an image which is sent by a client and contains an original address text; and carrying out image recognition on the image to obtain an original address text.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: triggering a client to generate a processing instruction; acquiring an original address text based on the processing instruction, wherein an error exists in the original address text; the method comprises the steps of sending an original address text to a server, and receiving a target address text returned by the server, wherein the target address text is generated based on a target first granularity geographical area and an original second granularity geographical area, the target first granularity geographical area is obtained by processing the original first granularity geographical area by using a trained text generation model, the original first granularity geographical area and the original second granularity geographical area are obtained by word segmentation of the original address text by using a trained sequence labeling model, and the target first granularity geographical area is correct; and outputting the text of the target address.
Optionally, the above storage medium is further configured to store program code for performing the steps of: acquiring an image containing an original address text based on a processing instruction; and carrying out image recognition on the image to obtain an original address text.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: triggering a client to generate a processing instruction; acquiring an image containing an original address text based on a processing instruction, wherein an error exists in the original address text; the method comprises the steps of sending an image to a server, and receiving a target address text returned by the server, wherein the image is subjected to image recognition by the server, the target address text is generated based on a target first granularity geographical area and an original second granularity geographical area, the target first granularity geographical area is obtained by processing the original first granularity geographical area by using a trained text generation model, the original first granularity geographical area and the original second granularity geographical area are obtained by word segmentation processing of the original address text by using a trained sequence labeling model, and the target first granularity geographical area is correct; and outputting the text of the target address.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: acquiring original data; processing the original data by using the trained first model to obtain original first granularity data and original second granularity data; processing the original first granularity data by using the trained second model to obtain target first granularity data; target data is generated based on the target first granularity data and the original second granularity data.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.
Claims (18)
1.A data processing method, comprising:
acquiring an original address text, wherein an error exists in the original address text;
Performing word segmentation processing on the original address text by using a trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area, wherein the original first granularity geographical area is an original administrative division text in the original address text, and the original second granularity geographical area is a detailed address text in the original address text;
processing the original first granularity geographical area by using a trained text generation model to obtain a target first granularity geographical area, wherein the target first granularity geographical area is correct;
And generating target address text based on the target first granularity geographical area and the original second granularity geographical area.
2. The method of claim 1, wherein word segmentation processing is performed on the original address text by using a trained sequence annotation model to obtain an original first granularity geographical region and an original second granularity geographical region, comprising:
processing the original address text to obtain a first feature vector of the original address text;
Processing the first feature vector by using a two-way long-short-term memory network to obtain a probability matrix of the original address text, wherein the probability matrix comprises: a set of probabilities for each first feature in the first feature vector, the set of probabilities comprising: a plurality of labels, and probability values corresponding to each label;
processing the probability matrix by using a conditional random field to obtain a target labeling sequence of the original address text, wherein the target labeling sequence comprises: a target tag for each of the first features;
dividing the original address text based on the target labeling sequence to obtain the original first granularity geographical area and the original second granularity geographical area.
3. The method of claim 2, wherein the method further comprises:
Acquiring a plurality of first address texts, wherein each first address text is correct;
Generating a plurality of first training data based on the plurality of first address texts, wherein each first training data comprises: the second address text corresponding to each first address text and the labeling sequence of the second address text are wrong;
and training the sequence annotation model by utilizing the plurality of first training data to obtain the trained sequence annotation model.
4. The method of claim 3, wherein generating a plurality of first training data based on the plurality of first address text comprises:
filtering the first address text based on a first granularity geographical area word list to obtain a first granularity geographical area and a second granularity geographical area;
noise processing is carried out on the first granularity geographical area and the second granularity geographical area respectively, so that a processed first granularity geographical area and a processed second granularity geographical area are obtained;
and generating the second address text based on the processed first granularity geographical region and the processed second granularity geographical region.
5. The method of claim 1, wherein processing the original first-granularity geographical region with a trained text generation model to obtain a target first-granularity geographical region comprises:
processing the original first granularity geographical area to obtain a second feature vector of the original first granularity geographical area;
Processing the second feature vector by using an encoder to obtain a target vector of the original first granularity geographic area;
And processing the target vector and the historical output result corresponding to the target vector by using a decoder to obtain the target first granularity geographical region.
6. The method of claim 5, wherein the method further comprises:
obtaining a first granularity geographical area vocabulary, wherein the first granularity geographical area vocabulary comprises: a plurality of first granularity geographical area words, each first granularity geographical area word being correct;
generating a plurality of second training data based on the first granularity geographical region vocabulary, wherein each second training data comprises: positive and negative samples of a first granularity geographic area, the positive sample being correct and there being an error in the negative sample;
And training the text generation model by utilizing the plurality of second training data to obtain the trained text generation model.
7. The method of claim 6, wherein generating a plurality of second training data based on the first granularity geographic area vocabulary comprises:
Arranging and combining the words in the geographic area with the first granularity to obtain the positive sample;
carrying out noise treatment on the positive sample to obtain the negative sample;
The second training data is generated based on the positive and negative samples.
8. The method of claim 6, wherein the type of first granularity geographic area word comprises one of: name, suffix, and minority.
9. A data processing method, comprising:
acquiring an original address text sent by a client, wherein an error exists in the original address text;
Performing word segmentation processing on the original address text by using a trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area, wherein the original first granularity geographical area is an original administrative division text in the original address text, and the original second granularity geographical area is a detailed address text in the original address text;
processing the original first granularity geographical area by using a trained text generation model to obtain a target first granularity geographical area, wherein the target first granularity geographical area is correct;
Generating a target address text based on the target first granularity geographical area and the original second granularity geographical area;
And sending the target address text to the client.
10. The method of claim 9, wherein obtaining the original address text sent by the client comprises:
Receiving an image which is sent by the client and contains the original address text;
and carrying out image recognition on the image to obtain the original address text.
11. A data processing method, comprising:
triggering a client to generate a processing instruction;
the client acquires an original address text based on the processing instruction, wherein an error exists in the original address text;
The client sends the original address text to a server and receives a target address text returned by the server, wherein the target address text is generated based on a target first granularity geographical area and a target second granularity geographical area, the target first granularity geographical area is an original administrative division text in the original address text, the target first granularity geographical area is a detailed address text in the original address text, the target first granularity geographical area is obtained by processing the original first granularity geographical area by using a trained text generation model, the target first granularity geographical area and the target second granularity geographical area are obtained by word segmentation processing of the original address text by using a trained sequence labeling model, and the target first granularity geographical area is correct;
And the client outputs the target address text.
12. The method of claim 11, wherein the client obtaining the original address text based on the processing instructions comprises:
the client acquires an image containing the original address text based on the processing instruction;
And the client performs image recognition on the image to obtain the original address text.
13. A data processing method, comprising:
triggering a client to generate a processing instruction;
the client acquires an image containing an original address text based on the processing instruction, wherein an error exists in the original address text;
The client sends the image to a server and receives a target address text returned by the server, wherein the image is identified by the server, the target address text is generated based on a target first granularity geographical area and an original second granularity geographical area, the original first granularity geographical area is an original administrative division text in the original address text, the original second granularity geographical area is a detailed address text in the original address text, the target first granularity geographical area is obtained by processing the original first granularity geographical area by using a trained text generation model, the original first granularity geographical area and the original second granularity geographical area are obtained by word segmentation processing of the original address text by using a trained sequence labeling model, and the target first granularity geographical area is correct;
And the client outputs the target address text.
14. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the data processing method of any one of claims 1 to 13.
15. A computing device, comprising: a processor and a memory for executing a program stored in the memory, wherein the program when executed performs the data processing method of any one of claims 1 to 13.
16. A data processing system, comprising:
A processor; and
A memory, coupled to the processor, for providing instructions to the processor to process the following processing steps: acquiring an original address text, wherein an error exists in the original address text; performing word segmentation processing on the original address text by using a trained sequence labeling model to obtain an original first granularity geographical area and an original second granularity geographical area, wherein the original first granularity geographical area is an original administrative division text in the original address text, and the original second granularity geographical area is a detailed address text in the original address text; processing the original first granularity geographical area by using a trained text generation model to obtain a target first granularity geographical area, wherein the target first granularity geographical area is correct; and generating target address text based on the target first granularity geographical area and the original second granularity geographical area.
17. A data processing method, comprising:
acquiring original data;
Processing the original data by using a trained first model to obtain original first granularity data and original second granularity data, wherein the first granularity data is a major class of the original data, and the second granularity data is a minor class of the original data;
Processing the original first granularity data by using a trained second model to obtain target first granularity data;
Generating target data based on the target first granularity data and the original second granularity data.
18. The method of claim 17, wherein the raw data comprises one of: address text and images.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010010139.XA CN113076746B (en) | 2020-01-06 | 2020-01-06 | Data processing method and system, storage medium and computing device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010010139.XA CN113076746B (en) | 2020-01-06 | 2020-01-06 | Data processing method and system, storage medium and computing device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113076746A CN113076746A (en) | 2021-07-06 |
CN113076746B true CN113076746B (en) | 2024-05-31 |
Family
ID=76609348
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010010139.XA Active CN113076746B (en) | 2020-01-06 | 2020-01-06 | Data processing method and system, storage medium and computing device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113076746B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001134602A (en) * | 1999-11-08 | 2001-05-18 | Nippon Telegr & Teleph Corp <Ntt> | Method and device for analyzing address and recording medium with address analysis program recorded thereon |
CN107622061A (en) * | 2016-07-13 | 2018-01-23 | 阿里巴巴集团控股有限公司 | A kind of method, apparatus and system for determining address uniqueness |
CN108920457A (en) * | 2018-06-15 | 2018-11-30 | 腾讯大地通途(北京)科技有限公司 | Address Recognition method and apparatus and storage medium |
CN109684624A (en) * | 2017-10-18 | 2019-04-26 | 北京京东尚科信息技术有限公司 | A kind of method and apparatus in automatic identification Order Address road area |
CN109960795A (en) * | 2019-02-18 | 2019-07-02 | 平安科技(深圳)有限公司 | A kind of address information standardized method, device, computer equipment and storage medium |
CN110569322A (en) * | 2019-07-26 | 2019-12-13 | 苏宁云计算有限公司 | Address information analysis method, device and system and data acquisition method |
-
2020
- 2020-01-06 CN CN202010010139.XA patent/CN113076746B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001134602A (en) * | 1999-11-08 | 2001-05-18 | Nippon Telegr & Teleph Corp <Ntt> | Method and device for analyzing address and recording medium with address analysis program recorded thereon |
CN107622061A (en) * | 2016-07-13 | 2018-01-23 | 阿里巴巴集团控股有限公司 | A kind of method, apparatus and system for determining address uniqueness |
CN109684624A (en) * | 2017-10-18 | 2019-04-26 | 北京京东尚科信息技术有限公司 | A kind of method and apparatus in automatic identification Order Address road area |
CN108920457A (en) * | 2018-06-15 | 2018-11-30 | 腾讯大地通途(北京)科技有限公司 | Address Recognition method and apparatus and storage medium |
CN109960795A (en) * | 2019-02-18 | 2019-07-02 | 平安科技(深圳)有限公司 | A kind of address information standardized method, device, computer equipment and storage medium |
CN110569322A (en) * | 2019-07-26 | 2019-12-13 | 苏宁云计算有限公司 | Address information analysis method, device and system and data acquisition method |
Non-Patent Citations (1)
Title |
---|
Synchronization and identification of nonlinear systems by using a novel self-evolving interval type-2 fuzzy LSTM-neural network;Haiyue Wang et al;《Engineering Applications of Artificial Intelligence》;20190531;79-93 * |
Also Published As
Publication number | Publication date |
---|---|
CN113076746A (en) | 2021-07-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110837579B (en) | Video classification method, apparatus, computer and readable storage medium | |
CN112199375B (en) | Cross-modal data processing method and device, storage medium and electronic device | |
CN110069650B (en) | Searching method and processing equipment | |
CN108319888B (en) | Video type identification method and device and computer terminal | |
CN110738262B (en) | Text recognition method and related product | |
CN111783760A (en) | Character recognition method and device, electronic equipment and computer readable storage medium | |
CN114170468B (en) | Text recognition method, storage medium and computer terminal | |
CN113761105A (en) | Text data processing method, device, equipment and medium | |
CN111651674B (en) | Bidirectional searching method and device and electronic equipment | |
CN114880514A (en) | Image retrieval method, image retrieval device and storage medium | |
CN111666771A (en) | Semantic label extraction device, electronic equipment and readable storage medium of document | |
CN111274813B (en) | Language sequence labeling method, device storage medium and computer equipment | |
CN112651449B (en) | Method, device, electronic equipment and storage medium for determining content characteristics of video | |
CN113704509A (en) | Multimedia recommendation method and device, electronic equipment and storage medium | |
CN113076746B (en) | Data processing method and system, storage medium and computing device | |
CN113761917A (en) | Named entity identification method and device | |
CN114443904B (en) | Video query method, device, computer equipment and computer readable storage medium | |
CN116091956A (en) | Video-based micro-expression recognition method, device and storage medium | |
CN115687701A (en) | Text processing method | |
CN111401083B (en) | Name identification method and device, storage medium and processor | |
CN111538914B (en) | Address information processing method and device | |
CN110956034B (en) | Word acquisition method and device and commodity search method | |
CN114913444A (en) | Video processing method and device, and data training method, device and system | |
CN112287722A (en) | In-vivo detection method and device based on deep learning and storage medium | |
CN110929508B (en) | Word vector generation method, device and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |