CN111161861A - Short text data processing method and device for hospital logistics operation and maintenance - Google Patents
Short text data processing method and device for hospital logistics operation and maintenance Download PDFInfo
- Publication number
- CN111161861A CN111161861A CN201911408217.5A CN201911408217A CN111161861A CN 111161861 A CN111161861 A CN 111161861A CN 201911408217 A CN201911408217 A CN 201911408217A CN 111161861 A CN111161861 A CN 111161861A
- Authority
- CN
- China
- Prior art keywords
- word
- word segmentation
- bank
- effective
- segmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H40/00—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
- G16H40/20—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the management or administration of healthcare resources or facilities, e.g. managing hospital staff or surgery rooms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Public Health (AREA)
- Primary Health Care (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Epidemiology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a short text data processing method, a device, a computer device and a storage medium for hospital logistics operation and maintenance, which are characterized in that a classification word bank is determined, the classification word bank is recombined to obtain a preliminary word segmentation word bank, the word segmentation processing is carried out on a material bank, word frequency statistics is carried out on each keyword after word segmentation, the statistical results are respectively added into the preliminary word segmentation word bank to obtain a user-defined word segmentation word bank, full-mode word segmentation and cleaning are carried out on a short text to be processed to obtain auxiliary information word segmentation and a plurality of effective items, fuzzy matching is carried out on each effective item in the classification word bank to obtain an initial matching result of each effective item, a matching result with the highest word frequency is selected as a final matching result of each effective item, the short text to be processed is accurately segmented according to semantics to obtain the auxiliary information word segmentation and a final matching result, and effective text information of the short text to be processed is determined, the accuracy of the determined valid text information is improved.
Description
Technical Field
The invention relates to the technical field of signal processing, in particular to a short text data processing method and device for hospital logistics operation and maintenance, computer equipment and a storage medium.
Background
Reporting and repairing are extremely frequent and very important daily works in large and ultra-large functional building groups because various automatic instruments, equipment and systems have various failures at irregular intervals. In order to ensure normal operation, a professional logistics team is usually set to perform various maintenance works on the whole system, and various links such as discovery, reporting, recording, evaluation and the like of maintenance points are involved in the whole process. The traditional processing mode is to manually record key information of each link, and the complexity is self-evident. Although most of the work has been replaced by software with the development of IT technology, major changes are limited to electronizing mechanically strong work at present. Even the hot-handiness artificial intelligence technology is far from solving the problem that the subjective judgment business needs manual intervention, particularly the natural language processing is limited by the inherent difficulty in the field, and the related technology is developed slowly. And each link of hospital logistics operation and maintenance can relate to more natural language and text processing, if technology development is mature enough, will bring very big efficiency promotion and cost reduction for this field.
According to practical situations, key character information in the field of hospital logistics operation and maintenance has inherent characteristics, such as limited objects, relatively simple sentence structure and semantics, limited sentence length and the like. These features provide the possibility of automatic processing of text in a particular scene. For example, the phrase "hose burst under female changing room wash basin at right side of first endoscope center in east courtyard" requires that it be pressed as "courtyard: [] And area: [] And department: [] And the object: [] And failure: [] And cutting and extracting in different dimensions. This is an extremely simple task for a human, but it is an extremely complicated process to have a computer perform the same task. The existing related treatment methods are generally classified into the following types:
regular matching: and generating a certain 'fixed' pattern by utilizing the defined specific characters and the combination thereof, and recursively applying the pattern to the text object to be processed to indicate that the matching is successful when the text and the sub-fragments thereof meet the pattern.
Extracting key words based on a statistical method: the original text is firstly divided into different segments, then a certain frequency or weight of each word is calculated in different modes, and finally the word with the highest score is the keyword. Such as a TextRank algorithm, a RAKE algorithm, a TF-IWF algorithm, etc.
Extracting keywords based on machine learning: the learning mode mainly comprises supervised keyword extraction, semi-supervised keyword extraction and unsupervised keyword extraction. Mathematically, the method can be divided into a keyword extraction method based on statistical characteristics, a keyword extraction method based on a word graph model, a keyword extraction method based on a topic model, and the like.
Keyword extraction is one of the core tasks of text processing, especially short text data processing, and is also an important branch of natural language processing. However, the existing keyword extraction technology is difficult to segment and extract the precise keywords of the short text, and the following key problems mainly exist:
the regularization matching method comprises the following steps: firstly, the regularization method must meet a fixed mode, is only suitable for an application scene of accurate matching, and is too single for extracting short text keywords with complex mode changes, so that the actual requirements cannot be met; secondly, the method requires an accurate "flag" at the time of matching, which further limits the application of the method.
Statistical-based methods: the extraction effect of the keywords of the long text is good, but the extraction effect of the keywords of the short text hardly works, because the occurrence frequency of the keywords in the long text is more than that of other words to a great extent, and the repeated keywords are difficult to exist in the short text or a single sentence, so that the fact basis of statistics is lost.
Method based on machine learning: firstly, a large amount of corpus is needed for training in a machine learning-based method, and especially, the requirement of deep learning with a neural network as a core on the training data volume is extremely strict; secondly, the core algorithm of the machine learning method is usually based on some character model or language expression model, such as grammar network diagram, or calculates some parameter, such as clustering coefficient, these characteristics are not obvious in the very short text; thirdly, the accuracy of the method based on machine learning in keyword classification is not enough, and the actual requirement cannot be met.
The aforementioned methods also have two common problems: firstly, the extracted keywords can not be subjected to category judgment, such as 'operating room', belonging to 'room' but not 'floor'; secondly, the extracted keywords cannot be polymerized to the previous stage, so that a meaningful phrase is formed, for example, that the hose below the female washbasin on the right side of the first endoscope center in the eastern area needs to be dropped, and the conventional word segmentation or segmentation result is that the hose below the hose needs to be dropped, but the hose needs to be dropped in the eastern area, the right side of the first endoscope center, the female washbasin, the lower side, and the like. It can be seen that the conventional short text data processing scheme often has the problem of low accuracy.
Disclosure of Invention
In order to solve the problems, the invention provides a short text data processing method and device, computer equipment and a storage medium for hospital logistics operation and maintenance.
In order to achieve the purpose of the invention, the short text data processing method for hospital logistics operation and maintenance is provided, and comprises the following steps:
s10, determining a classified word bank according to the corpus; the classified word bank is used for describing the word category of each word included in the corpus;
s20, recombining the classified word banks according to a preset word segmentation mode to obtain a preliminary word segmentation word bank, performing word segmentation processing on the word bank according to the preliminary word segmentation word bank, performing word frequency statistics on each keyword after word segmentation, and adding statistical results into the preliminary word segmentation word bank respectively to obtain a user-defined word segmentation word bank;
s30, performing full-mode word segmentation on the short text to be processed by adopting the self-defined word segmentation word bank, cleaning word segmentation results of the full-mode word segmentation to obtain auxiliary information word segmentation and a plurality of effective items of the short text to be processed, matching each effective item in the classification word bank in a fuzzy matching mode to obtain initial matching results of each effective item, and selecting the matching result with the highest word frequency from the initial matching results of each effective item as the final matching result of each effective item;
and S40, determining effective text information of the short text to be processed according to the auxiliary information word segmentation and the final matching result.
In one embodiment, the determining the effective text information of the short text to be processed according to the auxiliary information word segmentation and the final matching result includes:
determining the arrangement position of each final matching result, and determining an effective word sequence according to the arrangement position;
determining the position of the auxiliary information word in the effective word sequence;
and merging the auxiliary information word segmentation and the effective word sequence according to the position of the auxiliary information word segmentation in the effective word sequence.
In one embodiment, before determining the classified lexicon according to the corpus, the method further comprises:
and collecting the sentences aiming at the description objects in a set time period to obtain sentence sources, and constructing a corpus according to the sentence sources.
As one example, the descriptive object includes a hospital.
In one embodiment, the recombining the classified lexicon according to a preset word segmentation mode to obtain a preliminary word segmentation lexicon, performing word segmentation processing on the corpus according to the preliminary word segmentation lexicon, performing word frequency statistics on each segmented keyword, adding statistical results into the preliminary word segmentation lexicon respectively, and after obtaining the user-defined word segmentation lexicon, further includes:
acquiring a public shutdown word bank, and constructing a custom shutdown word bank according to the public shutdown word bank and a classification word bank; and the self-defined disabled word bank is used for cleaning word segmentation results of the full-mode word segmentation.
As one embodiment, the cleaning the word segmentation result of the full mode word segmentation includes:
and identifying stop words in the word segmentation result of the full-mode word segmentation by adopting the self-defined stop word bank, removing repeated words in the word segmentation result of the full-mode word segmentation after the identified stop words are removed to obtain a plurality of effective items, and determining the identified stop words as auxiliary information word segmentation.
A short text data processing device for hospital logistics operation and maintenance, comprising:
the first determining module is used for determining a classified word bank according to the corpus; the classified word bank is used for describing the word category of each word included in the corpus;
the recombination module is used for recombining the classified word banks according to a preset word segmentation mode to obtain a preliminary word segmentation word bank, performing word segmentation processing on the word bank according to the preliminary word segmentation word bank, performing word frequency statistics on each keyword after word segmentation, and adding statistical results into the preliminary word segmentation word bank respectively to obtain a user-defined word segmentation word bank;
the word segmentation module is used for performing full-mode word segmentation on the short text to be processed by adopting the self-defined word segmentation word bank, cleaning word segmentation results of the full-mode word segmentation to obtain auxiliary information word segmentation and a plurality of effective items of the short text to be processed, matching each effective item in the classification word bank in a fuzzy matching mode to obtain an initial matching result of each effective item, and selecting a matching result with the highest word frequency from the initial matching results of each effective item as a final matching result of each effective item;
and the second determining module is used for determining effective text information of the short text to be processed according to the auxiliary information word segmentation and the final matching result.
In one embodiment, the second determination module is to:
determining the arrangement position of each final matching result, and determining an effective word sequence according to the arrangement position;
determining the position of the auxiliary information word in the effective word sequence;
and merging the auxiliary information word segmentation and the effective word sequence according to the position of the auxiliary information word segmentation in the effective word sequence.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the short text data processing method for hospital logistics operation and maintenance of any of the above embodiments when executing the computer program.
A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the short text data processing method for hospital logistics operation and maintenance of any of the above embodiments.
The short text data processing method, the device, the computer equipment and the storage medium for the hospital logistics operation and maintenance determine a classification word bank according to a corpus, recombine the classification word bank according to a preset word segmentation mode to obtain a preliminary word segmentation word bank, perform word segmentation processing on the corpus according to the preliminary word segmentation word bank, perform word frequency statistics on each keyword after word segmentation, add the statistical results into the preliminary word segmentation word bank respectively to obtain a user-defined word segmentation word bank, perform full-mode word segmentation on a short text to be processed by adopting the user-defined word segmentation word bank, wash word segmentation results of the full-mode word segmentation to obtain auxiliary information word segmentation and a plurality of effective items of the short text to be processed, perform matching on each effective item in the classification word bank by adopting a fuzzy matching mode to obtain an initial matching result of each effective item, and select a matching result with the highest word frequency from the initial matching results of each effective item as a final matching result of each effective item, the short text to be processed is accurately segmented according to semantics to obtain auxiliary information word segmentation and a final matching result, so that effective text information of the short text to be processed is determined, and the accuracy of the determined effective text information is improved. The method can obviously improve the extraction of the key information of the short text in the logistics of the hospital, the extraction of the key words of the short text, the classification statistics of the short text and the analysis precision of related data. A method foundation is provided for intelligentization and IT upgrading of part of logistics business.
Drawings
FIG. 1 is a flow diagram of a short text data processing method for hospital logistics operation and maintenance, according to an embodiment;
FIG. 2 is a diagram of a custom thesaurus of segmented words, according to an embodiment;
FIG. 3 is a diagram illustrating an exemplary process for obtaining a matching result;
FIG. 4 is a diagram illustrating an arrangement of primary and secondary information according to one embodiment;
FIG. 5 is a diagram illustrating assisted information word segmentation and merging, according to an embodiment;
FIG. 6 is a diagram illustrating assisted information word segmentation and merging, according to an embodiment;
FIG. 7 is a diagram illustrating assisted information word segmentation and merging, according to an embodiment;
FIG. 8 is a schematic diagram of a short text data processing device for hospital logistics operation and maintenance according to an embodiment;
FIG. 9 is a schematic diagram of a computer device of an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The short text data processing method for the hospital logistics operation and maintenance can be applied to a short text data processing terminal for the hospital logistics operation and maintenance. The short text data processing terminal can determine a classified word bank according to a corpus, reorganize the classified word bank according to a preset word segmentation mode to obtain a preliminary word segmentation word bank, perform word segmentation processing on the corpus according to the preliminary word segmentation word bank, perform word frequency statistics on each keyword after word segmentation, add statistical results into the preliminary word segmentation word bank respectively to obtain a user-defined word segmentation word bank, perform full-mode word segmentation on a short text to be processed by adopting the user-defined word segmentation word bank, clean word segmentation results of the full-mode word segmentation to obtain auxiliary information word segmentation and a plurality of effective items of the short text to be processed, match each effective item in the classified word bank in a fuzzy matching mode to obtain an initial matching result of each effective item, and select a matching result with the highest word frequency from the initial matching results of each effective item as a final matching result of each effective item, and determining effective text information of the short text to be processed according to the auxiliary information word segmentation and the final matching result so as to improve the accuracy of the determined effective text information of the short text to be processed. The short text data processing terminal can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices.
In one embodiment, as shown in fig. 1, a short text data processing method for hospital logistics operation and maintenance is provided, which is described by taking an example that the method is applied to a short text data processing terminal, and includes the following steps:
s10, determining a classified word bank according to the corpus; the classified word bank is used for describing the word categories of all words included in the corpus.
Specifically, the classification lexicon can be customized according to the description object of the corpus. The final goal of the taxonomy thesaurus is two: the method is used for determining the category of extracted keywords, and solving the problem of possible disordered word sequences of the original short text, such as '2 lamp-bad east courtyard areas with two floors, 2 lamps-bad in the east courtyard areas', the result of word segmentation and recombination of the original sentence is '2 lamp-bad \ two floors \ east courtyard areas', obviously, the three belong to different dimensions, and if the category of each dimension is specified, the sentence structure of the original sentence can be adjusted to be in a conventional mode. The process of building a thesaurus of taxonomies is actually a tagging process. There are many methods for obtaining the classified word stock, which can be directly processed manually from the corpus or directly download the related sub-type keyword stock to form the classified word stock. For example, the key information in the logistics operation and maintenance work order of hospital can be divided into: region, floor, department, object, phenomenon, direction, and the like (word type). . In one example, part of the content of the above-mentioned classified lexicon can be referred to as shown in table 1.
TABLE 1 schematic content of the thesaurus
S20, recombining the classified word banks according to a preset word segmentation mode to obtain a preliminary word segmentation word bank, performing word segmentation processing on the word bank according to the preliminary word segmentation word bank, performing word frequency statistics on each keyword after word segmentation, and adding statistical results into the preliminary word segmentation word bank respectively to obtain a user-defined word segmentation word bank.
The preset word segmentation mode can be set according to the description object of the corpus.
In an example, the foregoing steps are explained with reference to the classified word library shown in table 1, and the final word library shown in fig. 2 is a custom word library:
the word stock of the conventional word segmentation method cannot adapt to the professional field due to the special requirements of the field where the description object is located, so that a special word segmentation word stock needs to be established. The formation of the word segmentation word bank comprises two steps:
(1) reorganizing the classified word stock according to the requirement of a word segmentation method, wherein the word stock does not contain word frequency, but initially has a self-defined word segmentation function and is only slightly low in word segmentation precision;
(2) and (3) performing word segmentation processing on the whole corpus by using the primary word segmentation word bank obtained in the step (1), performing word frequency statistics on each keyword after word segmentation, and adding the final statistical result into the word segmentation word banks obtained in the step (1) respectively so as to obtain a complete user-defined word segmentation word bank.
S30, performing full-mode word segmentation on the short text to be processed by adopting the self-defined word segmentation word bank, cleaning word segmentation results of the full-mode word segmentation to obtain auxiliary information word segmentation and a plurality of effective items of the short text to be processed, matching each effective item in the classification word bank in a fuzzy matching mode to obtain initial matching results of each effective item, and selecting the matching result with the highest word frequency from the initial matching results of each effective item as the final matching result of each effective item.
In the full-mode word segmentation process, an open-source word segmentation tool (such as jieba, SnowNLP, THULAC, NLPIR, etc.) can be used for word segmentation. The auxiliary information words may include the words of orientation, ordinal number, and letter.
Specifically, the above steps perform full-mode word segmentation on the original short text to be processed, the word segmentation method may use an open-source word segmentation tool (such as jieba, SnowNLP, THULAC, NLPIR, etc.), and it should be noted that the word segmentation thesaurus must use a custom word segmentation thesaurus obtained in S20. And (4) removing stop words and repeated words from the word segmentation result to filter the word segmentation result without practical meaning and further screen effective items. And matching the cleaned word segments in a classified word bank in a fuzzy matching mode, ranking the matching results, and taking the highest scoring person as the final matching result.
In one example, the process of obtaining the matching result (e.g., the final matching result) can be as shown in fig. 3.
Further, for the field of hospital logistics operation and maintenance, the main content of the work order can be generally described by the mode of "[ area, building, floor, department, room, object, phenomenon ]". Wherein "area, building, floor, department, room" describes location information, "object" describes a specific thing to be processed, and "phenomenon" describes a specific case of "object" and is referred to as main information. These several main information can be respectively denoted by "1, 2,3,4,5,6, 7", and a complete work order should be in the "[ 1,2,3,4,5,6,7 ]" mode, at least in the "[ 6,7 ]" mode, i.e. including at least objects and phenomena, otherwise the work order is meaningless. The "orientation, ordinal number, numeral, letter" are respectively denoted by "a, b, c, d", and are referred to as auxiliary information. Alternatively, the arrangement diagram of the main information (e.g. the effective items) and the auxiliary information can be referred to fig. 4.
And S40, determining effective text information of the short text to be processed according to the auxiliary information word segmentation and the final matching result.
The above steps can combine the rest final matching results by the auxiliary information word segmentation characteristics to determine the effective text information of the short text to be processed, and ensure the accuracy of the obtained effective text information.
The short text data processing method for the hospital logistics operation and maintenance comprises the steps of determining a classification word bank according to a corpus, recombining the classification word bank according to a preset word segmentation mode to obtain a preliminary word segmentation word bank, performing word segmentation processing on the corpus according to the preliminary word segmentation word bank, performing word frequency statistics on each keyword after word segmentation, adding statistical results into the preliminary word segmentation word bank respectively to obtain a user-defined word segmentation word bank, performing full-mode word segmentation on a short text to be processed by adopting the user-defined word segmentation word bank, cleaning word segmentation results of the full-mode word segmentation to obtain auxiliary information word segmentation and a plurality of effective items of the short text to be processed, matching each effective item in the classification word bank in a fuzzy matching mode to obtain an initial matching result of each effective item, and selecting a matching result with the highest word frequency from the initial matching results of each effective item as a final matching result of each effective item, the short text to be processed is accurately segmented according to semantics to obtain auxiliary information word segmentation and a final matching result, so that effective text information of the short text to be processed is determined, and the accuracy of the determined effective text information is improved.
In one embodiment, the determining the effective text information of the short text to be processed according to the auxiliary information word segmentation and the final matching result includes:
determining the arrangement position of each final matching result, and determining an effective word sequence according to the arrangement position;
determining the position of the auxiliary information word in the effective word sequence;
and merging the auxiliary information word segmentation and the effective word sequence according to the position of the auxiliary information word segmentation in the effective word sequence.
The embodiment can accurately merge the auxiliary information word segmentation and the final matching result so as to further improve the accuracy of the determined effective text information.
In one example, referring to fig. 4, after an actual short text work order (e.g. short text to be processed) is processed by the above-mentioned short text data processing method for hospital logistics operation and maintenance, a vector consisting of "1, 2,3,4,5,6, a, b, c, d" is output (the number and sequence of elements are determined by the actual short text content). But not yet practical, it is necessary to combine the side information with the main information (e.g., significance in the final matching result), and finally output a vector containing only numbers. For example, the result of the self-defined word segmentation of "water leakage from northeast corner 3 changing room in floor 2 of building No. 6" is "6 \ building \ 2\ floor \ northeast corner \ 3\ changing room \ water leakage", the corresponding mode vector is [ c,2, b,3, a, c,5,6,7], the final output result is [ c2, b3a, c5,6,7], and the corresponding result is "6 building \ northeast corner \ 3 changing room \ water leakage". How a computer (such as a short text data processing terminal) accurately combines auxiliary information word segmentation with main information word segmentation is a difficult point. This example uses the "Chinese language to express the basic model" to solve this problem, and the case-by-case discussion follows:
(1) the auxiliary information word is at the beginning or end of sentence
A plurality of (mostly only one) continuous auxiliary information participles at the beginning of the sentence are directly merged (without considering the encoding category), and then merged with the most adjacent main information participles at the back, and the encoding of the merged content is the original encoding of the merged main information participles (as shown in fig. 5). When the auxiliary information word is positioned at the end of the sentence, the auxiliary information word is directly combined with the most adjacent main information word, and the process is completely consistent with the beginning of the sentence.
(2) Auxiliary information participles in sentences
For orientation word
The term "azimuth" refers to the east, west, south, north, interior, exterior, up, down, intermediate position terms and their combined derivatives, such as east, south, east, interior, etc. The method is mainly characterized in that the method has no practical meaning when existing alone, and other real words are needed to be used as reference words. Most cases satisfy a biased structure like "x south east face". Thus, the text combines the directional words in the sentence with the keywords preceding it, as shown in FIG. 6.
For ordinal words
Ordinal words such as "first", "second", etc. may be used as subjects (e.g., i.e., i.e. Further extracting the results directly related to the ordinal number, the second row, the second and the third, and obtaining the general conclusion that the ordinal number directly modifies the contents behind the ordinal number. Thus, the present document provides that when an ordinal word is in a sentence, it is merged with the following keyword.
Aiming at numbers and letters
The individual numbers and letters have no practical meaning in a sentence and must be combined with other content. According to the Chinese expression habit, the direct contents of both modifications are positioned behind. Thus, the present example provides for combining numeric and alphabetical auxiliary keywords with the results located thereafter, as shown in FIG. 7.
It should be noted that merging letters and numbers with subsequent keywords can guarantee the correctness of most scenes, but if the actual content has multiple semantics, the result may also be erroneous. Such as "changing room 3 with broken lamp tube" and "3" with modified changing room or lamp tube can be explained.
In the example, the corresponding short text keyword extraction segmentation can achieve accurate segmentation, and the corresponding short text data processing method for the hospital logistics operation and maintenance has the following technical advantages:
(1) flow of treatment
The technical process provided by the example is suitable for carrying out precise segmentation on the short text according to the meaning, namely the segmentation with the minimum granularity and the segmentation of paragraphs, but the phrase segmentation based on the practical meaning. Meanwhile, phrases can be classified according to a fixed mode, so that subsequent statistical analysis and sentence pattern recombination are facilitated.
(2) Keyword classification
The conventional word segmentation method includes the part of speech of the keyword, but cannot classify the keyword. The scheme increases classification dimensionality in addition to conventional attributes when custom word segmentation is formulated. The phrase is divided into two main categories of 'main information' and 'auxiliary information', and subclasses are continuously divided below each category. After the text to be processed is subjected to word segmentation, a complete result and a corresponding type are found in a user-defined classification word bank by utilizing fuzzy matching.
(3) Keyword coding
The main information and the auxiliary information are respectively coded by numbers and letters, so that the character key words are conveniently converted into numbers and letters, and the mode of the input short text can be judged according to the combination mode of the letters and the numbers. Meanwhile, the problems of keyword sequence confusion, narration and narration possibly existing in the input of short texts are solved.
(4) Pattern processing
Starting from the Chinese expression habit, auxiliary phrases such as 'azimuth words', 'ordinal words', 'numbers', 'letters' and the like are combined with front and rear keywords according to the characteristics of the auxiliary phrases, so that the possibility of meaningless word segmentation after direct word segmentation is reduced, the semantic integrity of effective word segmentation is improved, and the subsequent processing is more convenient.
In one embodiment, before determining the classified lexicon according to the corpus, the method further comprises:
and collecting the sentences aiming at the description objects in a set time period to obtain sentence sources, and constructing a corpus according to the sentence sources.
The set period may comprise a longer period of the last 3 years.
As one example, the descriptive object includes a hospital.
The method and the device can ensure the accuracy of the constructed corpus.
In one example. Taking a certain large-scale hospital as an example of a description object, a corpus can be constructed by collecting accumulated data (i.e., words describing the occurrence of the object) in recent years for a specific field of the large-scale hospital.
In one embodiment, the recombining the classified lexicon according to a preset word segmentation mode to obtain a preliminary word segmentation lexicon, performing word segmentation processing on the corpus according to the preliminary word segmentation lexicon, performing word frequency statistics on each segmented keyword, adding statistical results into the preliminary word segmentation lexicon respectively, and after obtaining the user-defined word segmentation lexicon, further includes:
acquiring a public shutdown word bank, and constructing a custom shutdown word bank according to the public shutdown word bank and a classification word bank; and the self-defined disabled word bank is used for cleaning word segmentation results of the full-mode word segmentation.
Specifically, the original short text usually contains some contents irrelevant to word processing, which are collectively called stop words, and the stop words not only reduce the word processing efficiency, but also interfere with word segmentation class judgment and mode determination. The number of common stop words is about 1800, and the common stop words are judged to be the common stop words, and if no filtering is added, semantic missing and errors are caused in a specific field. Stop words such as "lower", "upper", "one", etc. have definite practical meanings in the field of hospital logistics operation and maintenance, and must be reserved. The embodiment can obtain the custom deactivation lexicon by adopting the following modes:
custom thesaurus-public decommissioned thesaurus ∩ custom classified thesaurus
In the formula, "∩" indicates that the intersection is taken.
In one embodiment, the cleaning the word segmentation result of the full-mode word segmentation includes:
and identifying stop words in the word segmentation result of the full-mode word segmentation by adopting the self-defined stop word bank, removing repeated words in the word segmentation result of the full-mode word segmentation after the identified stop words are removed to obtain a plurality of effective items, and determining the identified stop words as auxiliary information word segmentation.
In the word segmentation result of the full-mode word segmentation, the words appearing in the custom deactivation word bank are stop words.
The cleaning performed by the embodiment has better cleaning effect.
Referring to fig. 8, fig. 8 is a schematic structural diagram of a short text data processing apparatus for hospital logistics operation and maintenance according to an embodiment, including:
a first determining module 10, configured to determine a classified lexicon according to the corpus; the classified word bank is used for describing the word category of each word included in the corpus;
a regrouping module 20, configured to regroup the classified lexicon according to a preset word segmentation mode to obtain a preliminary word segmentation lexicon, perform word segmentation processing on the corpus according to the preliminary word segmentation lexicon, perform word frequency statistics on each keyword after word segmentation, add the statistical results to the preliminary word segmentation lexicon respectively, and obtain a user-defined word segmentation lexicon;
the word segmentation module 30 is configured to perform full-mode word segmentation on the short text to be processed by using the custom word segmentation word bank, clean word segmentation results of the full-mode word segmentation to obtain auxiliary information word segmentation and a plurality of effective items of the short text to be processed, match each effective item in the classification word bank in a fuzzy matching manner to obtain an initial matching result of each effective item, and select a matching result with the highest word frequency from the initial matching results of each effective item as a final matching result of each effective item;
and the second determining module 40 is configured to determine effective text information of the short text to be processed according to the auxiliary information word segmentation and the final matching result.
In one embodiment, the second determination module is to:
determining the arrangement position of each final matching result, and determining an effective word sequence according to the arrangement position;
determining the position of the auxiliary information word in the effective word sequence;
and merging the auxiliary information word segmentation and the effective word sequence according to the position of the auxiliary information word segmentation in the effective word sequence.
For specific limitations of the short text data processing device for hospital logistics operation and maintenance, reference may be made to the above limitations of the short text data processing method for hospital logistics operation and maintenance, and details are not repeated here. The modules in the short text data processing device for hospital logistics operation and maintenance can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a short text data processing method for hospital logistics operation and maintenance. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
Based on the examples described above, there is also provided in one embodiment a computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, implements the short text data processing method for hospital logistics operation and maintenance as in any of the embodiments described above.
It will be understood by those skilled in the art that all or part of the processes in the methods of the embodiments described above may be implemented by a computer program, which may be stored in a non-volatile computer-readable storage medium, and in the embodiments of the present invention, the program may be stored in the storage medium of a computer system and executed by at least one processor in the computer system, so as to implement the processes of the embodiments including the short text data processing method for hospital logistics operation and maintenance described above. The storage medium may be a magnetic disk, an optical disk, a Read-only Memory (ROM), a Random Access Memory (RAM), or the like.
Accordingly, in an embodiment, there is also provided a computer storage medium, a computer readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the short text data processing method for hospital logistics operation and maintenance as in any of the above embodiments.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
It should be noted that the terms "first \ second \ third" referred to in the embodiments of the present application merely distinguish similar objects, and do not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may exchange a specific order or sequence when allowed. It should be understood that "first \ second \ third" distinct objects may be interchanged under appropriate circumstances such that the embodiments of the application described herein may be implemented in an order other than those illustrated or described herein.
The terms "comprising" and "having" and any variations thereof in the embodiments of the present application are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or device that comprises a list of steps or modules is not limited to the listed steps or modules but may alternatively include other steps or modules not listed or inherent to such process, method, product, or device.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A short text data processing method for hospital logistics operation and maintenance is characterized by comprising the following steps:
s10, determining a classified word bank according to the corpus; the classified word bank is used for describing the word category of each word included in the corpus;
s20, recombining the classified word banks according to a preset word segmentation mode to obtain a preliminary word segmentation word bank, performing word segmentation processing on the word bank according to the preliminary word segmentation word bank, performing word frequency statistics on each keyword after word segmentation, and adding statistical results into the preliminary word segmentation word bank respectively to obtain a user-defined word segmentation word bank;
s30, performing full-mode word segmentation on the short text to be processed by adopting the self-defined word segmentation word bank, cleaning word segmentation results of the full-mode word segmentation to obtain auxiliary information word segmentation and a plurality of effective items of the short text to be processed, matching each effective item in the classification word bank in a fuzzy matching mode to obtain initial matching results of each effective item, and selecting the matching result with the highest word frequency from the initial matching results of each effective item as the final matching result of each effective item;
and S40, determining effective text information of the short text to be processed according to the auxiliary information word segmentation and the final matching result.
2. The short text data processing method for hospital logistics operation and maintenance according to claim 1, wherein the determining valid text information of the short text to be processed according to the auxiliary information word segmentation and the final matching result comprises:
determining the arrangement position of each final matching result, and determining an effective word sequence according to the arrangement position;
determining the position of the auxiliary information word in the effective word sequence;
and merging the auxiliary information word segmentation and the effective word sequence according to the position of the auxiliary information word segmentation in the effective word sequence.
3. The short text data processing method for hospital logistics operation and maintenance according to claim 1, wherein before determining the classified lexicon according to the corpus, further comprising:
and collecting the sentences aiming at the description objects in a set time period to obtain sentence sources, and constructing a corpus according to the sentence sources.
4. The short text data processing method for hospital logistics operation and maintenance according to claim 3, characterized in that the description object comprises a hospital.
5. The short text data processing method for hospital logistics operation and maintenance according to any one of claims 1 to 4, wherein the method further comprises the steps of recombining the classified lexicon according to a preset word segmentation mode to obtain a preliminary word segmentation lexicon, performing word segmentation processing on the corpus according to the preliminary word segmentation lexicon, performing word frequency statistics on each keyword after word segmentation, adding the statistical results into the preliminary word segmentation lexicon respectively to obtain a user-defined word segmentation lexicon:
acquiring a public shutdown word bank, and constructing a custom shutdown word bank according to the public shutdown word bank and a classification word bank; and the self-defined disabled word bank is used for cleaning word segmentation results of the full-mode word segmentation.
6. The short text data processing method for hospital logistics operation and maintenance according to claim 5, wherein the cleaning of the word segmentation results of full-mode word segmentation comprises:
and identifying stop words in the word segmentation result of the full-mode word segmentation by adopting the self-defined stop word bank, removing repeated words in the word segmentation result of the full-mode word segmentation after the identified stop words are removed to obtain a plurality of effective items, and determining the identified stop words as auxiliary information word segmentation.
7. A short text data processing device for hospital logistics operation and maintenance, comprising:
the first determining module is used for determining a classified word bank according to the corpus; the classified word bank is used for describing the word category of each word included in the corpus;
the recombination module is used for recombining the classified word banks according to a preset word segmentation mode to obtain a preliminary word segmentation word bank, performing word segmentation processing on the word bank according to the preliminary word segmentation word bank, performing word frequency statistics on each keyword after word segmentation, and adding statistical results into the preliminary word segmentation word bank respectively to obtain a user-defined word segmentation word bank;
the word segmentation module is used for performing full-mode word segmentation on the short text to be processed by adopting the self-defined word segmentation word bank, cleaning word segmentation results of the full-mode word segmentation to obtain auxiliary information word segmentation and a plurality of effective items of the short text to be processed, matching each effective item in the classification word bank in a fuzzy matching mode to obtain an initial matching result of each effective item, and selecting a matching result with the highest word frequency from the initial matching results of each effective item as a final matching result of each effective item;
and the second determining module is used for determining effective text information of the short text to be processed according to the auxiliary information word segmentation and the final matching result.
8. The short text data processing device for hospital logistics operation and maintenance according to claim 7, characterized in that the second determination module is configured to:
determining the arrangement position of each final matching result, and determining an effective word sequence according to the arrangement position;
determining the position of the auxiliary information word in the effective word sequence;
and merging the auxiliary information word segmentation and the effective word sequence according to the position of the auxiliary information word segmentation in the effective word sequence.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 6 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911408217.5A CN111161861A (en) | 2019-12-31 | 2019-12-31 | Short text data processing method and device for hospital logistics operation and maintenance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911408217.5A CN111161861A (en) | 2019-12-31 | 2019-12-31 | Short text data processing method and device for hospital logistics operation and maintenance |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111161861A true CN111161861A (en) | 2020-05-15 |
Family
ID=70559682
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911408217.5A Pending CN111161861A (en) | 2019-12-31 | 2019-12-31 | Short text data processing method and device for hospital logistics operation and maintenance |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111161861A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112148750A (en) * | 2020-10-20 | 2020-12-29 | 成都中科大旗软件股份有限公司 | Data integration method and system |
CN112989761A (en) * | 2021-05-20 | 2021-06-18 | 腾讯科技(深圳)有限公司 | Text classification method and device |
CN114580398A (en) * | 2022-03-15 | 2022-06-03 | 中国工商银行股份有限公司 | Text information extraction model generation method, text information extraction method and device |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104750665A (en) * | 2013-12-30 | 2015-07-01 | 腾讯科技(深圳)有限公司 | Text message processing method and text message processing device |
WO2017088126A1 (en) * | 2015-11-25 | 2017-06-01 | 华为技术有限公司 | Method and device for obtaining out-of-vocabulary word |
CN106897290A (en) * | 2015-12-17 | 2017-06-27 | 中国移动通信集团上海有限公司 | A kind of method and device for setting up keyword models |
US10146751B1 (en) * | 2014-12-31 | 2018-12-04 | Guangsheng Zhang | Methods for information extraction, search, and structured representation of text data |
CN110377724A (en) * | 2019-07-01 | 2019-10-25 | 厦门美域中央信息科技有限公司 | A kind of corpus keyword Automatic algorithm based on data mining |
CN110399385A (en) * | 2019-06-24 | 2019-11-01 | 厦门市美亚柏科信息股份有限公司 | A kind of semantic analysis and system for small data set |
CN110543637A (en) * | 2019-09-06 | 2019-12-06 | 知者信息技术服务成都有限公司 | Chinese word segmentation method and device |
CN110597988A (en) * | 2019-08-28 | 2019-12-20 | 腾讯科技(深圳)有限公司 | Text classification method, device, equipment and storage medium |
-
2019
- 2019-12-31 CN CN201911408217.5A patent/CN111161861A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104750665A (en) * | 2013-12-30 | 2015-07-01 | 腾讯科技(深圳)有限公司 | Text message processing method and text message processing device |
US10146751B1 (en) * | 2014-12-31 | 2018-12-04 | Guangsheng Zhang | Methods for information extraction, search, and structured representation of text data |
WO2017088126A1 (en) * | 2015-11-25 | 2017-06-01 | 华为技术有限公司 | Method and device for obtaining out-of-vocabulary word |
CN106897290A (en) * | 2015-12-17 | 2017-06-27 | 中国移动通信集团上海有限公司 | A kind of method and device for setting up keyword models |
CN110399385A (en) * | 2019-06-24 | 2019-11-01 | 厦门市美亚柏科信息股份有限公司 | A kind of semantic analysis and system for small data set |
CN110377724A (en) * | 2019-07-01 | 2019-10-25 | 厦门美域中央信息科技有限公司 | A kind of corpus keyword Automatic algorithm based on data mining |
CN110597988A (en) * | 2019-08-28 | 2019-12-20 | 腾讯科技(深圳)有限公司 | Text classification method, device, equipment and storage medium |
CN110543637A (en) * | 2019-09-06 | 2019-12-06 | 知者信息技术服务成都有限公司 | Chinese word segmentation method and device |
Non-Patent Citations (1)
Title |
---|
陈德华等: "病理镜检文本数据的结构化处理方法", 《计算机与现代化》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112148750A (en) * | 2020-10-20 | 2020-12-29 | 成都中科大旗软件股份有限公司 | Data integration method and system |
CN112148750B (en) * | 2020-10-20 | 2023-04-25 | 成都中科大旗软件股份有限公司 | Data integration method and system |
CN112989761A (en) * | 2021-05-20 | 2021-06-18 | 腾讯科技(深圳)有限公司 | Text classification method and device |
CN112989761B (en) * | 2021-05-20 | 2021-08-24 | 腾讯科技(深圳)有限公司 | Text classification method and device |
CN114580398A (en) * | 2022-03-15 | 2022-06-03 | 中国工商银行股份有限公司 | Text information extraction model generation method, text information extraction method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108399228B (en) | Article classification method and device, computer equipment and storage medium | |
CN109726293B (en) | Causal event map construction method, system, device and storage medium | |
CN111143576A (en) | Event-oriented dynamic knowledge graph construction method and device | |
Daumé III et al. | A large-scale exploration of effective global features for a joint entity detection and tracking model | |
CN110263177B (en) | Knowledge graph construction method for event prediction and event prediction method | |
CN110750959A (en) | Text information processing method, model training method and related device | |
CN108647205A (en) | Fine granularity sentiment analysis model building method, equipment and readable storage medium storing program for executing | |
CN110309114B (en) | Method and device for processing media information, storage medium and electronic device | |
CN109086265B (en) | Semantic training method and multi-semantic word disambiguation method in short text | |
CN110347790B (en) | Text duplicate checking method, device and equipment based on attention mechanism and storage medium | |
CN111143571B (en) | Entity labeling model training method, entity labeling method and device | |
CN111161861A (en) | Short text data processing method and device for hospital logistics operation and maintenance | |
US11170169B2 (en) | System and method for language-independent contextual embedding | |
CN106844341A (en) | News in brief extracting method and device based on artificial intelligence | |
CN112215008A (en) | Entity recognition method and device based on semantic understanding, computer equipment and medium | |
CN112966525B (en) | Law field event extraction method based on pre-training model and convolutional neural network algorithm | |
CN110297893A (en) | Natural language question-answering method, device, computer installation and storage medium | |
CN111488732A (en) | Deformed keyword detection method, system and related equipment | |
CN109947934A (en) | For the data digging method and system of short text | |
CN110674301A (en) | Emotional tendency prediction method, device and system and storage medium | |
CN112989208A (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN112287656A (en) | Text comparison method, device, equipment and storage medium | |
CN112307754A (en) | Statement acquisition method and device | |
CN117708351B (en) | Deep learning-based technical standard auxiliary review method, system and storage medium | |
CN110929518A (en) | Text sequence labeling algorithm using overlapping splitting rule |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |