CN109284483B - Text processing method and device, storage medium and electronic equipment - Google Patents

Text processing method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN109284483B
CN109284483B CN201811413346.9A CN201811413346A CN109284483B CN 109284483 B CN109284483 B CN 109284483B CN 201811413346 A CN201811413346 A CN 201811413346A CN 109284483 B CN109284483 B CN 109284483B
Authority
CN
China
Prior art keywords
text
abnormal
processed
identifier
processing method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811413346.9A
Other languages
Chinese (zh)
Other versions
CN109284483A (en
Inventor
滕召荣
李坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Golden Panda Ltd
Original Assignee
Golden Panda Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Golden Panda Ltd filed Critical Golden Panda Ltd
Priority to CN201811413346.9A priority Critical patent/CN109284483B/en
Publication of CN109284483A publication Critical patent/CN109284483A/en
Application granted granted Critical
Publication of CN109284483B publication Critical patent/CN109284483B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure relates to a text processing method, a text processing apparatus, a computer-readable storage medium, and an electronic device. The text processing method provided by the embodiment of the disclosure comprises the following steps: detecting whether the text to be processed contains an abnormal identifier; if the text to be processed is detected to contain the abnormal mark, text cleaning is carried out on the abnormal mark; and carrying out structuring treatment on the text to be treated to obtain structured data. The text processing method provided by the embodiment of the disclosure can greatly reserve the effective data in the text to be processed, and avoid the problem of data loss.

Description

Text processing method and device, storage medium and electronic equipment
Technical Field
The disclosure relates to the technical field of computers, and in particular relates to a text processing method, a text processing device, a computer readable storage medium and electronic equipment.
Background
Structuring is an important technique in NLP (Natural Language Processing ), where structuring of text is the extraction of desired content from natural language text to form structured data. This necessarily uses tools such as regularization and lexicon to match the resulting structured data. A normal medical text in china should be mostly chinese characters, which may be doped with a small number, letters or special characters. If a large number of digits, english letters, or abnormal symbols appear in a piece of text, it can be considered that the piece of text is abnormal. When the abnormal medical data text is structured, since the regularization is a greedy mode, on one hand, very much resources are consumed on the regularization matching, on the other hand, very many (possibly up to millions) data objects are generated, which causes the operating system to have insufficient resources to process, resulting in very high load, and the time spent on the abnormal text may take days to be computationally infeasible. Therefore, the detection and cleaning technology of the truly abnormal medical text is an important technology and is not easy to grasp.
The current method for checking and alarming the medical abnormal text mainly comprises the following two steps:
the first is abnormal matching, checking whether a plurality of continuous numbers, english characters or special characters appear in the medical text, if so, the medical text is considered to be abnormal text, and after alarming, the abnormal text is discarded.
The second is timeout detection, which is to check a medical text by setting a timeout mechanism when structuring the text, and generally only takes a certain time when structuring the normal medical text, so that when the time spent in structuring reaches a certain threshold, the text is considered to be abnormal, and then the abnormal text is discarded.
The above two methods have a very single judgment dimension for the abnormal text, and a large amount of useful data which can be structured is discarded when the abnormal examination of the medical text is carried out, so that the serious problem of normal data loss is caused.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The present disclosure is directed to a text processing method, a text processing apparatus, a computer-readable storage medium, and an electronic device, and thus, at least to some extent, to overcome the technical problem of serious data loss due to the limitations and disadvantages of the related art.
According to one aspect of the present disclosure, there is provided a text processing method, which is characterized by comprising:
detecting whether the text to be processed contains an abnormal identifier;
if the text to be processed is detected to contain the abnormal mark, text cleaning is carried out on the abnormal mark;
and carrying out structuring treatment on the text to be treated to obtain structured data.
In an exemplary embodiment of the present disclosure, the detecting whether the text to be processed includes an anomaly identification includes:
detecting the length of a text to be processed, and judging whether the length is larger than a preset threshold value or not;
if the length is larger than the preset threshold, detecting whether the text to be processed contains an abnormal mark.
In an exemplary embodiment of the present disclosure, the detecting whether the text to be processed includes an anomaly identifier further includes:
and if the length is less than or equal to the preset threshold value, carrying out structuring treatment on the text to be processed to obtain structured data.
In an exemplary embodiment of the present disclosure, the structuring the text to be processed to obtain structured data includes:
detecting abnormal characteristics of the text to be processed to judge whether the text to be processed is a normal text or an abnormal text;
and if the text to be processed is judged to be the normal text, carrying out structuring processing on the text to be processed to obtain structured data.
In an exemplary embodiment of the present disclosure, the detecting the abnormal feature of the text to be processed to determine whether the text to be processed is a normal text or an abnormal text includes:
detecting whether the text to be processed contains continuous non-Chinese fields or not;
if the text to be processed is detected to contain continuous non-Chinese fields, judging that the text to be processed is an abnormal text;
and if the text to be processed does not contain continuous non-Chinese fields, judging that the text to be processed is a normal text.
In an exemplary embodiment of the disclosure, the structuring the text to be processed to obtain structured data further includes:
if the text to be processed is judged to be the abnormal text, the abnormal text is imported into an abnormal text set;
and when the abnormal text set meets the preset condition, sending abnormal text prompt information.
In an exemplary embodiment of the present disclosure, after sending the abnormal text prompt, the method further includes:
and analyzing the abnormal text in the abnormal text set, and acquiring an abnormal identifier to form an abnormal identifier set.
According to an aspect of the present disclosure, there is provided a text processing apparatus, characterized by comprising:
the detection module is configured to detect whether the text to be processed contains an abnormal identifier or not;
the cleaning module is configured to clean the text of the part containing the abnormal identifier in the text to be processed if the part containing the abnormal identifier in the text to be processed is detected;
and the processing module is configured to perform structuring processing on the text to be processed to obtain structured data.
According to one aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements any of the above-described text processing methods.
According to one aspect of the present disclosure, there is provided an electronic device, characterized by comprising a processor and a memory; wherein the memory is for storing executable instructions of the processor, the processor being configured to perform any of the text processing methods described above via execution of the executable instructions.
According to the text processing method provided by the embodiment of the disclosure, abnormal parts in the text can be cleaned instead of simple full text discarding by performing abnormal identification detection on the text to be processed, so that effective data in the text to be processed can be reserved greatly, and the problem of data loss is avoided.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.
Fig. 1 schematically illustrates a flow chart of steps of a text processing method in an exemplary embodiment of the present disclosure.
Fig. 2 schematically illustrates a flow chart of steps of a text processing method in another exemplary embodiment of the present disclosure.
Fig. 3 schematically illustrates a flow chart of steps of a text processing method in another exemplary embodiment of the present disclosure.
Fig. 4 schematically shows a flow chart of steps of a text processing method in another exemplary embodiment of the present disclosure.
Fig. 5 schematically shows a block diagram of a text processing apparatus in an exemplary embodiment of the present disclosure.
Fig. 6 schematically illustrates a schematic diagram of a program product in an exemplary embodiment of the present disclosure.
Fig. 7 schematically illustrates a block diagram of an electronic device in an exemplary embodiment of the present disclosure.
Fig. 8 schematically illustrates a flow chart of a text processing method in an application scenario according to an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
In an exemplary embodiment of the present disclosure, a text processing method is provided, which is mainly applicable to the structuring processing of medical texts, so as to obtain structured data from the medical texts. The medical text can comprise various electronic texts containing medical data, such as outpatient medical records, inpatient medical records and the like. Referring to fig. 1, the text processing method provided in this embodiment mainly includes the following steps:
s110, detecting whether the text to be processed contains an abnormal identifier.
Firstly, detecting text content of a text to be processed, and judging whether the text content contains an abnormal mark, wherein the abnormal mark can comprise a messy code, a meaningless string of numbers, letters or special characters and the like in the text. In order to better detect the abnormal mark, an abnormal mark detection rule can be formulated, and if the rule is met, the abnormal mark can be judged; in addition, an abnormal identifier set can be preset, the abnormal identifiers contained in the abnormal identifier set are used for carrying out matching detection on the text to be processed, and if the matching text is detected, the matching text can be judged to be the abnormal identifier.
And S120, if the text to be processed is detected to contain the abnormal identifier, cleaning the text of the abnormal identifier.
According to the detection result of step S110, if it is detected that the text to be processed contains an abnormal identifier, the present step will clean the text for the abnormal identifier. The purpose of the text cleaning is to make the pending text no longer contain the anomaly identification. In order to improve the cleaning effect, the embodiment can perform cyclic cleaning on the abnormal identifiers in the text to be processed, namely repeatedly perform detection and text cleaning until the abnormal identifiers are not detected in the text to be processed, and then the cleaning can be considered to be completed. The method for text cleaning of the anomaly identification may be direct deletion, or may be replaced by a specific identifiable mark, or any other text processing method, which is not particularly limited in this exemplary embodiment.
And S130, carrying out structuring processing on the text to be processed to obtain structured data.
After the text cleaning in step S120, the text to be processed without the abnormal identifier is structured in this step to obtain structured data. The general process of the structuring processing can be to firstly segment the text to be processed, then to extract information from the segmented text, and finally to obtain structured data. The structuring of the text to be processed may utilize some data processing platform or computational engine, such as Apache Spark.
The text processing method provided by the exemplary embodiment can clear the abnormal part in the text instead of simply discarding the whole text by carrying out the abnormal identification detection on the text to be processed, so that the effective data in the text to be processed can be greatly reserved, and the problem of serious data loss is avoided.
On the basis of the above exemplary embodiment, another embodiment of the present disclosure provides a text processing method, where step s110. Detecting whether the text to be processed contains an anomaly identification may include the following steps as shown in fig. 2:
s211, detecting the length of the text to be processed, and judging whether the length is larger than a preset threshold value.
Because the abnormal medical text is usually an extra-long section of number, english or special character, the text length is detected first and whether the detected text length is larger than a preset threshold value is judged before the abnormal identification is detected and cleaned. The preset threshold may be directly related to information such as a source and a type of the text to be processed, or may be obtained through statistical analysis according to historical data of text processing, for example, 2000 characters, 3000 characters, etc., which is not limited in particular in this exemplary embodiment.
And S212, if the length is larger than the preset threshold, detecting whether the text to be processed contains an abnormal mark.
According to the detection result of step S211, if the length of the text to be processed is determined to be greater than the preset threshold, the text to be processed is considered to be most likely to be an abnormal text, and then whether the text to be processed contains an abnormal identifier can be continuously detected, so that text cleaning and subsequent structuring processing of the abnormal identifier can be continuously performed.
Based on the embodiment, step s110, detecting whether the text to be processed contains the abnormal identifier may further include the following steps:
and S213, if the length is smaller than or equal to the preset threshold value, carrying out structuring treatment on the text to be processed to obtain structured data.
If the detection result in step S211 is that the length of the text to be processed is less than or equal to the preset threshold, the text to be processed may be considered as a normal text, and directly enter a normal structuring flow to be structured to obtain structured data.
According to the text processing method provided by the embodiment, the suspected abnormal text can be identified by using the minimum calculation cost through detecting the length of the text to be processed, so that the text processing efficiency is improved, and the resource waste is avoided.
On the basis of the above exemplary embodiment, another embodiment of the present disclosure provides a text processing method, wherein step s130, performing a structuring process on a text to be processed to obtain structured data may include the following steps as shown in fig. 3:
and S331, detecting abnormal characteristics of the text to be processed to judge whether the text to be processed is a normal text or an abnormal text.
After the detection and cleaning of the abnormal identifier, the step detects the abnormal characteristics of the text to be processed, so as to more accurately judge whether the text to be processed is a normal text or an abnormal text. Similar to the anomaly identification, the anomaly characteristics used for detection in this step may also include a messy code, a meaningless series of numbers, letters, or special characters, etc., that occur in text. In other words, the abnormality feature detection performed in this step is related to a part of the detection content of the abnormality identification detection performed in step S110. In addition, the abnormal features used for detection in the step can also comprise other features generated or preset according to historical data, and particularly can comprise abnormal parts which are not detected in the detection of abnormal marks or abnormal parts which are detected but not cleaned. The reasons why the abnormal portion is not detected in the abnormal mark detection may include various reasons, for example, the abnormal mark set may be imperfect, or omission may occur in a large number of text processes. The reason why the abnormal portion has been detected but not cleaned may also include various kinds, for example, it may be that editing authority is set in the text to be processed, so that the abnormal portion is not allowed to be deleted or replaced, or the like. In addition, before the abnormal feature detection of the step is performed, the text length of the text to be processed may be detected, and if the text length exceeds a preset threshold, the abnormal feature detection of the step is performed. If the text length does not exceed the preset threshold value, the step can be skipped to directly carry out structuring processing on the text to be processed, so that the operation resources are saved, and the text processing efficiency is improved.
And S332, if the text to be processed is judged to be the normal text, carrying out structuring processing on the text to be processed to obtain structured data.
After the detection and the judgment in the step S331, if the judgment result is that the text to be processed is a normal text, the step performs the structuring processing on the text to be processed to obtain the structured data.
Based on this embodiment, step s130, performing a structuring process on the text to be processed to obtain structured data may further include the following steps:
s333, if the text to be processed is judged to be the abnormal text, the abnormal text is imported into an abnormal text set.
After the detection and judgment in step S331, if the judgment result is that the text to be processed is an abnormal text, the step will import the abnormal text into the abnormal text set. The abnormal text imported into the abnormal text set is temporarily not structured so as not to affect the structured processing process of the normal text.
And S334, when the abnormal text set meets the preset condition, sending abnormal text prompt information.
According to step S333, if abnormal text is continuously detected, abnormal text is continuously imported from the abnormal text set, and if the abnormal text set satisfies the preset condition, the step sends an abnormal text prompt message. The preset condition may be that the number of the abnormal texts in the abnormal text set exceeds a certain preset threshold, or that a certain time node is reached since the first abnormal text is imported in the abnormal text set, or that after a section of text processing is completed and the abnormal text is stored in the abnormal text set, which is not limited in this exemplary embodiment. The abnormal text prompt message can be any message which can play a role in prompting or warning, for example, can be an alarm mail sent to relevant business personnel. The abnormal text prompt information may include contents of the abnormal text, and may also include a save path of the abnormal text. In some related technologies, an abnormal text prompt message is sent every time an abnormal text is detected, so that the problems of too frequent prompt and redundant prompt messages exist, and the user experience is affected. By setting the preset conditions, the embodiment can well control the sending frequency and the number of the abnormal text prompt messages and optimize the user experience.
Preferably, after sending the abnormal text prompt message, the text processing method provided in this embodiment further includes the steps of: and analyzing the abnormal text in the abnormal text set, and acquiring the abnormal identifier to form an abnormal identifier set. The step can analyze the abnormal text, find the abnormal part existing in the abnormal text, and acquire the abnormal mark from the abnormal part to form an abnormal mark set. When a new anomaly identification is obtained later, the new anomaly identification can be supplemented to the anomaly identification set. The anomaly identification collection may provide support for anomaly identification detection performed in step S110. Along with the continuous enrichment and perfection of the abnormal identifier set, the detection and cleaning of the abnormal identifiers are more thorough, the detection of abnormal texts is continuously reduced, and the text processing efficiency is greatly improved.
Referring to fig. 4, in another exemplary embodiment of the present disclosure, step s331 of performing abnormal feature detection on the text to be processed to determine whether the text to be processed is a normal text or an abnormal text may further include the steps of:
step S3311, detecting whether the text to be processed contains continuous non-Chinese fields.
The method comprises the steps of firstly detecting whether a text to be processed contains continuous non-Chinese fields, wherein the continuous non-Chinese fields can be continuous fields composed of non-Chinese characters such as numbers, english or special characters.
And S3312, if the fact that the text to be processed contains continuous non-Chinese fields is detected, judging that the text to be processed is an abnormal text.
If step S3311 detects that the text to be processed contains consecutive non-Chinese fields, then the present step may determine that the text to be processed is an outlier text.
And S3313, judging that the text to be processed is a normal text if the text to be processed does not contain continuous non-Chinese fields.
If step S3311 does not detect that the text to be processed contains consecutive non-Chinese fields, then the present step may determine that the text to be processed is normal text.
Further, in this embodiment, the length of the continuous non-chinese field may be defined, for example, a continuous non-chinese field with a length of more than 10 characters may be regarded as a continuous non-chinese field, and a continuous non-chinese field with a length of less than 10 characters may be regarded as a normal text field.
The embodiment utilizes the continuous non-Chinese fields to detect the abnormal characteristics, can adapt to the language characteristics of Chinese medical texts, and can efficiently finish the detection of the abnormal texts.
It should be noted that while the above exemplary embodiments describe the steps of the methods in this disclosure in a particular order, this does not require or imply that the steps must be performed in that particular order or that all of the steps must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
In an exemplary embodiment of the present disclosure, there is also provided a text processing apparatus, and referring to fig. 5, the text processing apparatus 50 may mainly include: a detection module 51, a cleaning module 52 and a processing module 53. Wherein the detection module 51 is configured to detect whether the text to be processed contains an anomaly identification. The clearing module 52 is configured to clear text from a portion of the text to be processed that includes an anomaly identification if it is detected that the text to be processed includes an anomaly identification. The processing module 53 is configured to perform a structuring process on the text to be processed to obtain structured data.
The specific details of the above text processing device have been described in detail in the corresponding text processing method, and thus will not be described here again.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, can implement the above-described text processing method of the present disclosure. In some possible implementations, aspects of the disclosure may also be implemented in the form of a program product including program code; the program product may be stored on a non-volatile storage medium (which may be a CD-ROM, a U-disk or a removable hard disk, etc.) or on a network; when the program product is run on a computing device (which may be a personal computer, a server, a terminal device or a network device, etc.), the program code is for causing the computing device to carry out the method steps in the above-mentioned exemplary embodiments of the present disclosure.
Referring to fig. 6, a program product 60 for implementing the above-described methods according to embodiments of the present disclosure may employ a portable compact disk read-only memory (CD-ROM) and include program code and may run on a computing device (e.g., a personal computer, a server, a terminal device, or a network appliance, etc.). However, the program product of the present disclosure is not limited thereto. In the present exemplary embodiment, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may take the form of any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium.
The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the C programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's computing device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In cases involving remote computing devices, the remote computing devices may be connected to the user computing devices through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), etc.; alternatively, it may be connected to an external computing device, for example, using an Internet service provider to connect through the Internet.
In an exemplary embodiment of the present disclosure, there is also provided an electronic device including at least one processor and at least one memory for storing executable instructions of the processor; wherein the processor is configured to perform the method steps in the above-described exemplary embodiments of the present disclosure via execution of the executable instructions.
An electronic device 700 in the present exemplary embodiment is described below with reference to fig. 7. The electronic device 700 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
Referring to fig. 7, the electronic device 700 is embodied in the form of a general purpose computing device. Components of electronic device 700 may include, but are not limited to: at least one processing unit 710, at least one memory unit 720, a bus 730 connecting the different system components, including the processing unit 710 and the memory unit 720, a display unit 740.
Wherein the storage unit 720 stores program code executable by the processing unit 710 such that the processing unit 710 performs the method steps in the above-described exemplary embodiments of the present disclosure.
The memory unit 720 may include readable media in the form of volatile memory units, such as random access memory unit 721 (RAM) and/or cache memory unit 722, and may further include read only memory unit 723 (ROM).
The storage unit 720 may also include a program/utility 724 having a set (at least one) of program modules 725, including but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 730 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a bus using any of a variety of bus architectures.
The electronic device 700 may also communicate with one or more external devices 800 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that allow a user to interact with the electronic device 700, and/or any device (e.g., router, modem, etc.) that allows the electronic device 700 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 750. Also, electronic device 700 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 760. As shown in fig. 7, network adapter 760 may communicate with other modules of electronic device 700 via bus 730. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 700, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
The disclosed embodiments are described below in connection with an application scenario. Referring to fig. 8, the main flow of text processing in the application scenario is as follows:
1. since abnormal medical texts are typically very long pieces of numbers, english or special characters. Therefore, the text length detection is performed first, and when the length of the medical text is greater than a certain threshold value, the text is considered to be an abnormal text. If the text is smaller than the threshold value, the text is considered to be normal, and the text enters a normal structuring flow.
2. And secondly, cleaning the suspected abnormal text, checking whether the suspected abnormal text contains an abnormal mark, and performing cyclic cleaning if the suspected abnormal text contains the abnormal mark, namely repeatedly performing checking and cleaning until the abnormal mark cannot be checked in the text, and considering that the cleaning is completed.
3. And (3) carrying out real abnormal feature detection on the cleaned suspected abnormal text, carrying out continuous non-Chinese character detection on the cleaned suspected abnormal text, entering into an abnormal text set if the cleaned suspected abnormal text is detected to be the abnormal text, and carrying out normal structuring treatment if the cleaned suspected abnormal text is the normal text.
4. And finally, structuring all texts, writing the abnormal text set into a disk, recording paths, and sending mails to alarm the contents of the abnormal text part and the abnormal text paths to related responsible persons.
5. The related responsible person views the corresponding abnormal text, abstracts the corresponding abnormal identifier to enter the abnormal identifier set. And providing the cleaned feature for the suspected abnormal text.
In the application scene, the functions of length detection, data cleaning and the like are mainly added to abnormal text detection, a centralized alarm method is used on an abnormal text alarm mechanism, and user experience is improved.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
The above described features, structures or characteristics may be combined in any suitable manner in one or more embodiments, such as the possible, interchangeable features as discussed in connection with the various embodiments. In the above description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the disclosed aspects may be practiced without one or more of the specific details, or with other methods, components, materials, and so forth. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

Claims (7)

1. A text processing method, comprising:
detecting the length of a text to be processed, and judging whether the length is larger than a preset threshold value or not;
if the length is larger than the preset threshold value, detecting whether the text to be processed contains an abnormal mark or not, wherein the abnormal mark comprises a messy code, a meaningless string of numbers, letters or special characters in the text;
if the text to be processed is detected to contain the abnormal identifier, performing text circulation cleaning on the abnormal identifier until the abnormal identifier is not detected in the text to be processed; the cleaning is to replace the abnormal mark with a specific identifiable mark;
detecting abnormal characteristics of the cleaned text to be processed to judge whether the text to be processed is a normal text or an abnormal text; the abnormal characteristics comprise characteristics generated according to historical data, preset characteristics and the abnormal identifier;
if the text to be processed is judged to be a normal text, carrying out structuring processing on the text to be processed to obtain structured data;
the structuring the text to be processed to obtain structured data further includes:
if the text to be processed is judged to be the abnormal text, the abnormal text is imported into an abnormal text set;
and when the abnormal text set meets the preset condition, sending abnormal text prompt information.
2. The text processing method according to claim 1, wherein the detecting whether the text to be processed contains the anomaly identification further comprises:
and if the length is less than or equal to the preset threshold value, carrying out structuring treatment on the text to be processed to obtain structured data.
3. The text processing method according to claim 1, wherein the performing abnormal feature detection on the cleaned text to be processed to determine whether the text to be processed is a normal text or an abnormal text includes:
detecting whether the text to be processed contains continuous non-Chinese fields or not;
if the text to be processed is detected to contain continuous non-Chinese fields, judging that the text to be processed is an abnormal text;
and if the text to be processed does not contain continuous non-Chinese fields, judging that the text to be processed is a normal text.
4. A text processing method according to claim 3, wherein after sending the abnormal text prompt, the method further comprises:
and analyzing the abnormal text in the abnormal text set, and acquiring an abnormal identifier to form an abnormal identifier set.
5. A text processing apparatus, comprising:
the detection module is configured to detect the length of the text to be processed and judge whether the length is greater than a preset threshold value or not; if the length is larger than the preset threshold value, detecting whether the text to be processed contains an abnormal mark or not, wherein the abnormal mark comprises a messy code, a meaningless string of numbers, letters or special characters in the text;
the cleaning module is configured to, if the to-be-processed text contains the abnormal identifier, perform text circulation cleaning on the part containing the abnormal identifier in the to-be-processed text until the abnormal identifier is not detected in the to-be-processed text; the cleaning is to replace the abnormal mark with a specific identifiable mark;
the processing module is configured to detect abnormal characteristics of the cleaned text to be processed so as to judge whether the text to be processed is a normal text or an abnormal text; the abnormal characteristics comprise characteristics generated according to historical data, preset characteristics and the abnormal identifier; if the text to be processed is judged to be a normal text, carrying out structuring processing on the text to be processed to obtain structured data; the structuring the text to be processed to obtain structured data further includes: if the text to be processed is judged to be the abnormal text, the abnormal text is imported into an abnormal text set; and when the abnormal text set meets the preset condition, sending abnormal text prompt information.
6. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the text processing method of any of claims 1-4.
7. An electronic device, comprising:
a processor;
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the text processing method of any of claims 1-4 via execution of the executable instructions.
CN201811413346.9A 2018-11-23 2018-11-23 Text processing method and device, storage medium and electronic equipment Active CN109284483B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811413346.9A CN109284483B (en) 2018-11-23 2018-11-23 Text processing method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811413346.9A CN109284483B (en) 2018-11-23 2018-11-23 Text processing method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN109284483A CN109284483A (en) 2019-01-29
CN109284483B true CN109284483B (en) 2023-06-30

Family

ID=65172631

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811413346.9A Active CN109284483B (en) 2018-11-23 2018-11-23 Text processing method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN109284483B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112114987B (en) * 2019-06-20 2024-04-09 腾讯科技(深圳)有限公司 Abnormality detection method and device for operation environment, intelligent terminal and storage medium
CN112397159B (en) * 2019-08-19 2024-03-22 金色熊猫有限公司 Automatic entry method and device for clinical test report, electronic equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10073828B2 (en) * 2015-02-27 2018-09-11 Nuance Communications, Inc. Updating language databases using crowd-sourced input
CN105260357B (en) * 2015-10-14 2018-03-30 北京京东尚科信息技术有限公司 Sensitive word inspection method and equipment based on Hash digraph
CN106445915B (en) * 2016-09-14 2020-04-28 安徽科大讯飞医疗信息技术有限公司 New word discovery method and device
CN107657060B (en) * 2017-10-20 2020-06-30 中电科新型智慧城市研究院有限公司 Feature optimization method based on semi-structured text classification
CN108228851A (en) * 2018-01-10 2018-06-29 北京奇艺世纪科技有限公司 A kind of lists of keywords method of adjustment, device and electronic equipment

Also Published As

Publication number Publication date
CN109284483A (en) 2019-01-29

Similar Documents

Publication Publication Date Title
CN113705187B (en) Method and device for generating pre-training language model, electronic equipment and storage medium
CN110516971B (en) Anomaly detection method, device, medium and computing equipment
CN112636957B (en) Early warning method and device based on log, server and storage medium
CN108776696B (en) Node configuration method and device, storage medium and electronic equipment
CN112445775B (en) Fault analysis method, device, equipment and storage medium of photoetching machine
CN114328208A (en) Code detection method and device, electronic equipment and storage medium
CN109284483B (en) Text processing method and device, storage medium and electronic equipment
CN112084179B (en) Data processing method, device, equipment and storage medium
CN110888791A (en) Log processing method, device, equipment and storage medium
CN113971205A (en) Threat report attack behavior extraction method, device, equipment and storage medium
CN113609008A (en) Test result analysis method and device and electronic equipment
CN109684207B (en) Method and device for packaging operation sequence, electronic equipment and storage medium
CN111784176A (en) Data processing method, device, server and medium
CN115048352B (en) Log field extraction method, device, equipment and storage medium
CN108845794B (en) Streaming operation system, method, readable medium and storage controller
CN115665285A (en) Data processing method and device, electronic equipment and storage medium
CN114546780A (en) Data monitoring method, device, equipment, system and storage medium
CN110851316A (en) Abnormity early warning method, abnormity early warning device, abnormity early warning system, electronic equipment and storage medium
CN114416411A (en) Memory fault detection method and device
CN114925757A (en) Multi-source threat intelligence fusion method, device, equipment and storage medium
CN112989403B (en) Database damage detection method, device, equipment and storage medium
CN113806556A (en) Method, device, equipment and medium for constructing knowledge graph based on power grid data
CN112799957A (en) User behavior based fault handling method, system, device and medium
CN113760568A (en) Data processing method and device
CN112989817A (en) Automatic auditing method for meteorological early warning information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant