CN114897504A - Method, device, storage medium and electronic equipment for processing repeated letters - Google Patents

Method, device, storage medium and electronic equipment for processing repeated letters Download PDF

Info

Publication number
CN114897504A
CN114897504A CN202210546548.0A CN202210546548A CN114897504A CN 114897504 A CN114897504 A CN 114897504A CN 202210546548 A CN202210546548 A CN 202210546548A CN 114897504 A CN114897504 A CN 114897504A
Authority
CN
China
Prior art keywords
processed
repeated
letters
entity extraction
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210546548.0A
Other languages
Chinese (zh)
Inventor
李双贺
王颖
冯添
鄢阁俊
陈一朴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Peking University Software Engineering Co ltd
Original Assignee
Beijing Peking University Software Engineering Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Peking University Software Engineering Co ltd filed Critical Beijing Peking University Software Engineering Co ltd
Priority to CN202210546548.0A priority Critical patent/CN114897504A/en
Publication of CN114897504A publication Critical patent/CN114897504A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Strategic Management (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a method, a device, a storage medium and electronic equipment for processing repeated letters, wherein the method comprises the following steps: acquiring repeated letters to be processed; carrying out entity extraction on the repeated letters to be processed to obtain an entity extraction result; inputting the entity extraction result into a classification model trained in advance to obtain a classification result of the repeated letters to be processed; and executing corresponding processing on the repeated letters to be processed based on the classification result of the repeated letters to be processed. By means of the technical scheme, the embodiment of the application can at least achieve the technical effects of reducing the manual auditing pressure and improving the auditing efficiency.

Description

Method and device for processing repeated letters, storage medium and electronic equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing duplicate letters, a storage medium, and an electronic device.
Background
Handling duplicate letters refers to the act of the same person proposing the same thing more than twice over a certain period. Among them, the duplicate letters can be classified into unprocessed duplicate letters and non-accepted duplicate letters.
At present, the existing method for processing the duplicate letters is mainly carried out by a manual inspection method.
In the process of implementing the invention, the inventor finds that the following problems exist in the prior art: the existing method for processing the duplicate letters is realized by manual inspection, so that the inspection efficiency is low.
Disclosure of Invention
The embodiment of the application aims to provide a method, a device, a storage medium and electronic equipment for processing repeated letters, so that the auditing efficiency is improved.
In a first aspect, an embodiment of the present application provides a method for processing duplicate letters, the method including: acquiring repeated letters to be processed; carrying out entity extraction on the repeated letters to be processed to obtain an entity extraction result; inputting the entity extraction result into a classification model trained in advance to obtain a classification result of the repeated letters to be processed; and executing corresponding processing on the repeated letters to be processed based on the classification result of the repeated letters to be processed.
Therefore, by means of the technical scheme, the embodiment of the application can realize automatic identification of the repeated letters, so that compared with the existing manual checking method, the method at least can realize the technical effects of reducing the manual checking pressure and improving the checking efficiency.
In one possible embodiment, the entity extraction of the duplicate mails to be processed to obtain an entity extraction result includes: and inputting the repeated letters to be processed into the trained BilSTM-CRF model to obtain an entity extraction result.
In one possible embodiment, the training process of the BilSTM-CRF model comprises the following steps: acquiring sample training data; the method comprises the steps that sample training data are obtained by preprocessing sample repeated letters, and preprocessing the sample repeated letters comprises adding marks for punctuation coincidence in the sample repeated letters; and training the initial BilSTM-CRF model by using sample training data to obtain the trained BilSTM-CRF model.
In one possible embodiment, the duplicate letters to be processed comprise target persons, and the entity extraction result comprises the names of the target persons, the identification numbers of the target persons, the addresses of the target persons and the attributions of letter problems corresponding to the target persons.
In a second aspect, an embodiment of the present application provides an apparatus for processing duplicate letters, the apparatus comprising: the first acquisition module is used for acquiring the repeated letters to be processed; the entity extraction module is used for carrying out entity extraction on the repeated mails to be processed to obtain an entity extraction result; the input module is used for inputting the entity extraction result into a classification model trained in advance to obtain a classification result of the repeated letters to be processed; and the processing module is used for executing corresponding processing on the repeated letters to be processed based on the classification result of the repeated letters to be processed.
In one possible embodiment, the entity extraction module is used for inputting the repeated letters to be processed into the trained BilTM-CRF model to obtain the entity extraction result.
In one possible embodiment, the apparatus further comprises: the second acquisition module is used for acquiring sample training data; the method comprises the steps that sample training data are obtained by preprocessing sample repeated letters, and preprocessing the sample repeated letters comprises adding marks for punctuation coincidence in the sample repeated letters; and the training module is used for training the initial BilSTM-CRF model by utilizing the sample training data to obtain the trained BilSTM-CRF model.
In one possible embodiment, the duplicate letters to be processed comprise target persons, and the entity extraction result comprises the names of the target persons, the identification numbers of the target persons, the addresses of the target persons and the attributions of letter problems corresponding to the target persons.
In a third aspect, an embodiment of the present application provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program performs the method according to the first aspect or any optional implementation manner of the first aspect.
In a fourth aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the method of the first aspect or any of the alternative implementations of the first aspect.
In a fifth aspect, the present application provides a computer program product which, when run on a computer, causes the computer to perform the method of the first aspect or any possible implementation manner of the first aspect.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
FIG. 1 illustrates a flow chart of a method of handling duplicate correspondence as provided by an embodiment of the present application;
FIG. 2 is a block diagram illustrating an apparatus for handling duplicate letters according to an embodiment of the present application;
fig. 3 is a block diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The main source of the text data of the repeated letters is the repeated letter data registered by the information center, and the main content of the problems reflected by target persons (the target persons refer to persons who send letters), letter numbers, partial names, certificate numbers, letter purposes and the like, wherein the situation of data missing exists. In order to further improve the work efficiency of processing letters, solve the problem that the letter repetition rate is high, strongly promote the problem solution and contradiction solution, and practically maintain the legitimate rights and interests of the masses, it is urgently needed to greatly reduce the letter repetition events in a short time (for example, three years and the like).
By its nature, the automatic identification of duplicate letters is a matter of information extraction and multi-classification. The information extraction is to extract specific event or fact information from a natural language text to help us automatically classify, extract and reconstruct massive contents. And, such information typically includes entities, relationships, and events. Such as extracting time, place, key people from news, or product name, development time, performance indicators, etc. from technical documents. In the repeated mail processing, the name, certificate number, address and problem location of the target person are mainly extracted from the complaint content, and the complaint content of the complete mail is supplemented so that the relevant person can check and examine the complaint content.
And, in the field of text classification, the implementation methods can be roughly divided into two categories: traditional text classification based and deep learning based. The traditional text classification algorithm comprises naive Bayes and the like, but the traditional text classification method has certain disadvantages, the feature expression capability of the traditional text classification method is still to be improved, and the traditional text classification algorithm is widely used in the field of text classification, but the classification effect cannot reach the optimum. With the advance of deep learning, many deep learning algorithms are also widely applied to text classification tasks, such as TextRNN and FastText models, the text representation is solved by vectorizing words, for example, word2vec and other methods, then the feature expression capability is automatically obtained, and complicated feature engineering does not need to be manually performed, so that the text classification task effect is improved. In recent years, large-scale general pre-training models, such as BERT, GPT, etc., have appeared in succession, and the pre-training language models can learn more contents from massive data, store the contents in the model in the form of parameters, and can obtain SOTA expression in downstream tasks through appropriate fine adjustment.
At present, most of the existing automatic identification and classification tasks related to the mail processing service types are based on the traditional classification algorithm, and although the classification tasks can be realized, the accuracy rate needs to be improved. Moreover, most of the information texts related to the case source line are long texts, and the traditional text classification cannot well represent the semantics of the original texts.
Based on the above, the embodiment of the present application provides a scheme for processing duplicate letters, which obtains an entity extraction result by obtaining the duplicate letters to be processed and performing entity extraction on the duplicate letters to be processed, obtains a classification result of the duplicate letters to be processed by inputting the entity extraction result into a classification model trained in advance, and performs corresponding processing on the duplicate letters to be processed based on the classification result of the duplicate letters to be processed.
Therefore, by means of the technical scheme, the embodiment of the application can realize automatic identification of the repeated letters, so that compared with the existing manual checking method, the method at least can realize the technical effects of reducing the manual checking pressure and improving the checking efficiency.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for handling duplicate letters according to an embodiment of the present application. The method as shown in fig. 1 may be performed by an apparatus for handling duplicate letters and the apparatus may be an apparatus for handling duplicate letters as shown in fig. 2. And the specific device of the device can be set according to actual requirements, and the embodiment of the application is not limited to this. For example, the device may be a computer, a server, or the like. Specifically, the method as shown in fig. 1 includes:
and step S110, acquiring the repeated letters to be processed.
It should be understood that the method for acquiring duplicate letters to be processed may be set according to actual requirements, and the embodiments of the present application are not limited thereto.
For example, the duplicate letters to be processed can be obtained from the duplicate letter data registered in the information center.
And step S120, performing entity extraction on the repeated mails to be processed to obtain an entity extraction result.
It should be understood that, the specific process of extracting the entity from the duplicate mails to be processed to obtain the entity extraction result may be set according to actual requirements, and the embodiment of the present application is not limited thereto.
Optionally, the repeated letters to be processed can be input into a trained BilTM-CRF model to obtain an entity extraction result.
For example, the duplicate mail to be processed may be input into the trained BiLSTM-CRF model, and a position vector output by the trained BiLSTM-CRF model and used for identifying the position where the entity extraction result is located may be obtained, and the position vector may include a start position and an end position, so that the paragraph where the entity related content in the duplicate mail to be processed is located may be determined according to the start position and the end position. And after the position vector is obtained, extracting the content between the starting position and the ending position from the repeated mail to be processed by utilizing the starting position and the ending position in the position vector, and further obtaining an entity extraction result.
It should be understood that the specific structure, training process, etc. of the BiLSTM-CRF model can be set according to actual requirements, and the embodiments of the present application are not limited thereto.
Optionally, the sample duplicate mails may be obtained, and the purpose of "resolved" or "undivided" may be identified for the sample duplicate mails, so that it is ensured that the data of each data set is subjected to element extraction and completion for distinguishing classification. Namely, entity extraction is carried out according to the complaint content of the letter.
Based on this, the embodiment of the present application may add a corresponding index to each category to facilitate the score calculation, which may be specifically referred to in table 1 below.
TABLE 1
Figure RE-DEST_PATH_IMAGE001
And X yi Indicating a state score (referring to the position to which the element corresponds, y i Representing an index of categories), such as according to table 1, X above i=1,yi=2 =X w1,B-organisation =0.1。
And a transition matrix exists in the parameters of the BilSTM model, and the scores in the transition matrix are transition scores, before the model is trained, the scores of the transition matrix can be initialized randomly, and then the scores are updated continuously in the training process, namely the self-training constraint condition of the CRF. And, the loss function of the CRF is composed of the real path score and the total score of all paths, and the score of each possible path is set as P i The correct path is the real path, and the path is divided into:
Figure RE-992169DEST_PATH_IMAGE004
wherein, P total Is the total score of all paths; e is a constant; s i Is the sum of the state score and the transition score value.
And its loss function LossFunction = P Realpath /P total . Wherein. P Realpath Is the true path score.
And, parameters of the model can also be defined, see table 2 below.
TABLE 2
Figure RE-806541DEST_PATH_IMAGE005
And the convolutional neural network with the class weight is updated according to an algorithm in the training process of the convolutional network, the network output error is obtained by forward calculation, and then the network weight is updated reversely, so that the network output error is minimum. The output result of the convolution neural network is y = [ y ] 1 ,y 2 ,...,y N ],y∈R N×M And N is the number of samples the batch size contains, and M is the total number of classes of the input data set.
Based on the above arrangement, the embodiment of the application can obtain the sample repeated letters from the target database (for example, letter office database). And the sample repeated letters can be preprocessed to realize entity marking of the sample repeated letters so as to obtain sample training data. For example, a corresponding identifier may be added to punctuation marks in the sample repeat mail, so that the BiLSTM-CRF model can learn the knowledge of the relevant punctuation marks according to the identifier; for another example, the corresponding identification can be set for the name of the target person, the identification number of the target person, the address of the target person and the attribution of the letter problem corresponding to the target person.
Therefore, after the sample training data is obtained, the initial BilSTM-CRF model can be trained by using the sample training data, and the trained BilSTM-CRF model is obtained. Wherein the initial BilSTM-CRF model may also be referred to as an untrained BilSTM-CRF model.
And after the trained BilSTM-CRF model is obtained by the method, entity extraction can be carried out on the to-be-processed repeated letters based on the trained BilSTM-CRF model, and in the process of extracting the to-be-processed repeated letters, entity labeling and the like are not needed, namely, the entity labeling is only in the training stage.
In addition, in the process of determining the learning rate, the batch processing size, the threshold value and other hyperparameters of the model during gradient updating, a comparison experiment can be carried out on the HMM model, the BILSTM model and the BILSTM-CRF model, and the method for determining the BILSTM-CRF model is optimal by comparing the accuracy, the recall rate, the F1 value and other information of each category.
It should be noted here that there are three main categories of methods in the field of entity extraction since its development, namely rule/dictionary-based methods, traditional machine learning-based methods and deep learning-based methods which have been developed rapidly in recent years. For the former two, although the rule-based mode has high accuracy, a large amount of manpower and material resources are needed to mine the associations existing in the text, and then appropriate rules are further extracted aiming at the associations, and the extracted rules have poor transportability and cannot be applied to other data sets because the rules have pertinence, and when the data set of the rules is changed, the extracted rules need to be changed along with the rules; the machine learning-based method has great dependence on the corpus, high requirement on text feature selection, long training time and high cost. In recent years, methods based on deep learning models are gradually popularized and used, and neural networks can well deal with many problems related to natural language processing. Compared with the two modes, the method based on the deep learning model has small dependence on the material library, wide applicability and higher accuracy. Therefore, the embodiment of the application performs entity extraction by using the BILSTM-CRF deep learning model which is better represented in recent years.
And S130, inputting the entity extraction result into a classification model trained in advance to obtain a classification result of the repeated letters to be processed.
It should be understood that the specific model, the model structure, the error function, and the like of the classification model may be set according to actual requirements, and the embodiment of the present application is not limited thereto.
For example, a typical error function is:
Figure RE-160162DEST_PATH_IMAGE006
however, the present invention proposes that the improved error function E is:
Figure RE-427195DEST_PATH_IMAGE007
wherein, c ij Is that the ith sample is really the label weight of class j.
And, the specific calculation method is c j =T/d j . Where T is a hyperparameter and d j Is the total number of texts of the jth class label in the training set, it can be seen that the label weight is inversely proportional to the total number of texts of the label.
It should also be understood that the specific content of the classification result of the duplicate mails to be processed may be set according to actual requirements, and the embodiment of the present application is not limited thereto.
For example, the classification result of the duplicate letters to be processed can be whether the letter is included in the handling range. Wherein the 'suggestion' and 'release notice' are not included in the delivery range; the system is marked as 'three-span three-separation' and does not fall into the handling range; and the letter item with the problem location as the target area is not included in the delivery range.
And step S140, based on the classification result of the repeated letters to be processed, executing corresponding processing on the repeated letters to be processed.
It should be understood that, based on the classification result of the duplicate mails to be processed, the specific process of performing corresponding processing on the duplicate mails to be processed may be set according to actual requirements, and the embodiment of the present application is not limited thereto.
For example, when the classification result of the duplicate mails to be processed is that the duplicate mails are included in the handling range, whether the classification result is 'solved' or not is judged, if the classification result is 'solved', the solution rate can be calculated, the cases are determined to be duplicated, and the labels of the duplicate mails to be processed in the specified database are updated; if the mail is not resolved, the mail is notified to relevant workers so that the relevant workers can process the mail and update the labels of the duplicate mails to be processed in the designated database.
For another example, when the classification result of the duplicate mails to be processed is that the duplicate mails to be processed are not included in the handling range, the classification result is notified to relevant workers so that the relevant workers can process the mails, and the labels of the duplicate mails to be processed in the specified database are updated.
Therefore, the embodiment of the application can effectively optimize data loss existing in a manual method by means of strong automatic feature extraction capability of a deep neural network, reduces the pressure of manual repeated letter information review, and promotes the modernization of social management capability.
It should be understood that the above method for handling duplicate letters is only exemplary, and those skilled in the art can make various changes, modifications or alterations according to the above method.
Referring to fig. 2, fig. 2 is a block diagram illustrating an apparatus 200 for handling duplicate letters according to an embodiment of the present application. It should be understood that the apparatus 200 corresponds to the above method embodiment, and can perform the steps related to the above method embodiment, and the specific functions of the apparatus 200 can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy. The device 200 includes at least one software function module that can be stored in a memory in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the device 200. Specifically, the apparatus 200 includes:
a first obtaining module 210, configured to obtain a duplicate mail to be processed;
the entity extraction module 220 is used for performing entity extraction on the repeated mails to be processed to obtain an entity extraction result;
an input module 230, configured to input the entity extraction result into a pre-trained classification model to obtain a classification result of the duplicate mails to be processed;
and the processing module 240 is used for executing corresponding processing on the repeated letters to be processed based on the classification result of the repeated letters to be processed.
In one possible embodiment, the entity extraction module 220 is configured to input the duplicate mails to be processed into the trained BiLSTM-CRF model to obtain the entity extraction result.
In one possible embodiment, the apparatus 200 further comprises: a second acquisition module (not shown) for acquiring sample training data; the method comprises the steps that sample training data are obtained by preprocessing sample repeated letters, and preprocessing the sample repeated letters comprises adding marks for punctuation coincidence in the sample repeated letters; and a training module (not shown) for training the initial BilSTM-CRF model by using the sample training data to obtain a trained BilSTM-CRF model.
In one possible embodiment, the duplicate letters to be processed comprise target persons, and the entity extraction result comprises the names of the target persons, the identification numbers of the target persons, the addresses of the target persons and the attributions of letter problems corresponding to the target persons.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method, and will not be described in too much detail herein.
Referring to fig. 3, fig. 3 is a block diagram of an electronic device 300 according to an embodiment of the present disclosure. The electronic device 300 may include a processor 310, a communication interface 320, a memory 330, and at least one communication bus 340. Wherein the communication bus 340 is used for realizing direct connection communication of these components. The communication interface 320 in the embodiment of the present application is used for communicating signaling or data with other devices. The processor 310 may be an integrated circuit chip having signal processing capabilities. The Processor 310 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor 310 may be any conventional processor or the like.
The Memory 330 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 330 stores computer readable instructions, and when the computer readable instructions are executed by the processor 310, the electronic device 300 can perform the steps of the above method embodiments.
The electronic device 300 may further include a memory controller, an input-output unit, an audio unit, and a display unit.
The memory 330, the memory controller, the processor 310, the peripheral interface, the input/output unit, the audio unit, and the display unit are electrically connected to each other directly or indirectly to implement data transmission or interaction. For example, these elements may be electrically connected to each other via one or more communication buses 340. The processor 310 is used to execute the executable modules stored in the memory 330. Also, the electronic device 300 is configured to perform the following method: acquiring repeated letters to be processed; carrying out entity extraction on the repeated mails to be processed to obtain an entity extraction result; inputting the entity extraction result into a classification model trained in advance to obtain a classification result of the repeated mails to be processed; and executing corresponding processing on the repeated letters to be processed based on the classification result of the repeated letters to be processed.
The input and output unit is used for providing input data for a user to realize the interaction of the user and the server (or the local terminal). The input/output unit may be, but is not limited to, a mouse, a keyboard, and the like.
The audio unit provides an audio interface to the user, which may include one or more microphones, one or more speakers, and audio circuitry.
The display unit provides an interactive interface (e.g. a user interface) between the electronic device and a user or for displaying image data to a user reference. In this embodiment, the display unit may be a liquid crystal display or a touch display. In the case of a touch display, the display can be a capacitive touch screen or a resistive touch screen, which supports single-point and multi-point touch operations. The support of single-point and multi-point touch operations means that the touch display can sense touch operations simultaneously generated from one or more positions on the touch display, and the sensed touch operations are sent to the processor for calculation and processing.
It will be appreciated that the configuration shown in fig. 3 is merely illustrative and that the electronic device 300 may include more or fewer components than shown in fig. 3 or may have a different configuration than shown in fig. 3. The components shown in fig. 3 may be implemented in hardware, software, or a combination thereof.
The present application also provides a storage medium having a computer program stored thereon, which, when executed by a processor, performs the method of the method embodiments.
The present application also provides a computer program product which, when run on a computer, causes the computer to perform the method of the method embodiments.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method, and will not be described in too much detail herein.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of handling duplicate correspondence, comprising:
acquiring repeated letters to be processed;
carrying out entity extraction on the repeated mails to be processed to obtain an entity extraction result;
inputting the entity extraction result into a classification model trained in advance to obtain a classification result of the repeated mails to be processed;
and executing corresponding processing on the repeated letters to be processed based on the classification result of the repeated letters to be processed.
2. The method of claim 1, wherein the entity extracting the repeated letters to be processed to obtain an entity extraction result comprises:
and inputting the repeated letters to be processed into a trained BilSTM-CRF model to obtain the entity extraction result.
3. The method of claim 2, wherein the training process of the BilSTM-CRF model comprises:
acquiring sample training data; the sample training data is obtained by preprocessing a sample repeated letter, and the preprocessing of the sample repeated letter comprises adding an identifier for punctuation coincidence in the sample repeated letter;
and training an initial BilSTM-CRF model by using the sample training data to obtain the trained BilSTM-CRF model.
4. The method according to claim 1 or 2, wherein the duplicate mails to be processed comprise target persons, and the entity extraction result comprises names of the target persons, identification numbers of the target persons, addresses of the target persons and attributions of the corresponding letter problems of the target persons.
5. An apparatus for processing duplicate letters, comprising:
the first acquisition module is used for acquiring the repeated letters to be processed;
the entity extraction module is used for carrying out entity extraction on the repeated mails to be processed to obtain an entity extraction result;
the input module is used for inputting the entity extraction result into a classification model trained in advance to obtain a classification result of the repeated mails to be processed;
and the processing module is used for executing corresponding processing on the repeated letters to be processed based on the classification result of the repeated letters to be processed.
6. The apparatus of claim 5, wherein the entity extraction module is configured to input the duplicate mails to be processed into a trained BilSTM-CRF model to obtain the entity extraction result.
7. The apparatus of claim 6, further comprising:
the second acquisition module is used for acquiring sample training data; the sample training data is obtained by preprocessing a sample repeated letter, and the preprocessing of the sample repeated letter comprises adding an identifier for punctuation coincidence in the sample repeated letter;
and the training module is used for training an initial BilSTM-CRF model by using the sample training data to obtain the trained BilSTM-CRF model.
8. The apparatus of claim 5 or 6, wherein the duplicate mail to be processed comprises a target person, and the entity extraction result comprises a name of the target person, an identification number of the target person, an address of the target person, and a place of attribution of a corresponding mail question of the target person.
9. A storage medium, having stored thereon a computer program for performing, when executed by a processor, a method of handling duplicate correspondence according to any one of claims 1-4.
10. An electronic device, characterized in that the electronic device comprises: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the method of processing duplicate correspondence of any of claims 1-4.
CN202210546548.0A 2022-05-20 2022-05-20 Method, device, storage medium and electronic equipment for processing repeated letters Pending CN114897504A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210546548.0A CN114897504A (en) 2022-05-20 2022-05-20 Method, device, storage medium and electronic equipment for processing repeated letters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210546548.0A CN114897504A (en) 2022-05-20 2022-05-20 Method, device, storage medium and electronic equipment for processing repeated letters

Publications (1)

Publication Number Publication Date
CN114897504A true CN114897504A (en) 2022-08-12

Family

ID=82723522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210546548.0A Pending CN114897504A (en) 2022-05-20 2022-05-20 Method, device, storage medium and electronic equipment for processing repeated letters

Country Status (1)

Country Link
CN (1) CN114897504A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9779388B1 (en) * 2014-12-09 2017-10-03 Linkedin Corporation Disambiguating organization names
CN110990525A (en) * 2019-11-15 2020-04-10 华融融通(北京)科技有限公司 Natural language processing-based public opinion information extraction and knowledge base generation method
CN111143567A (en) * 2019-12-30 2020-05-12 成都数之联科技有限公司 Comment emotion analysis method based on improved neural network
CN113962293A (en) * 2021-09-29 2022-01-21 中国科学院计算机网络信息中心 LightGBM classification and representation learning-based name disambiguation method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9779388B1 (en) * 2014-12-09 2017-10-03 Linkedin Corporation Disambiguating organization names
CN110990525A (en) * 2019-11-15 2020-04-10 华融融通(北京)科技有限公司 Natural language processing-based public opinion information extraction and knowledge base generation method
CN111143567A (en) * 2019-12-30 2020-05-12 成都数之联科技有限公司 Comment emotion analysis method based on improved neural network
CN113962293A (en) * 2021-09-29 2022-01-21 中国科学院计算机网络信息中心 LightGBM classification and representation learning-based name disambiguation method and system

Similar Documents

Publication Publication Date Title
CN108804512B (en) Text classification model generation device and method and computer readable storage medium
US20230297582A1 (en) Systems and methods for automatic clustering and canonical designation of related data in various data structures
AU2020327704B2 (en) Classification of data using aggregated information from multiple classification modules
CN112184525B (en) System and method for realizing intelligent matching recommendation through natural semantic analysis
CN108885623B (en) Semantic analysis system and method based on knowledge graph
US11281860B2 (en) Method, apparatus and device for recognizing text type
US9779388B1 (en) Disambiguating organization names
WO2020139735A1 (en) Account manager virtual assistant using machine learning techniques
US10755045B2 (en) Automatic human-emulative document analysis enhancements
US10216838B1 (en) Generating and applying data extraction templates
CA3126644A1 (en) System and method for matching of database records based on similarities to search queries
US20170068866A1 (en) Method and system for data extraction from images of semi-structured documents
CA3048356A1 (en) Unstructured data parsing for structured information
CN112667825B (en) Intelligent recommendation method, device, equipment and storage medium based on knowledge graph
CN113095076A (en) Sensitive word recognition method and device, electronic equipment and storage medium
Nurkholis et al. Comparison of Kernel Support Vector Machine Multi-Class in PPKM Sentiment Analysis on Twitter
CN110866102A (en) Search processing method
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
US20220148049A1 (en) Method and system for initiating an interface concurrent with generation of a transitory sentiment community
CN111858942A (en) Text extraction method and device, storage medium and electronic equipment
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN114897504A (en) Method, device, storage medium and electronic equipment for processing repeated letters
Saputri et al. Sentiment analysis on shopee e-commerce using the naïve bayes classifier algorithm
CN114154480A (en) Information extraction method, device, equipment and storage medium
CN113656393B (en) Data processing method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220812

RJ01 Rejection of invention patent application after publication