CN116226362A

CN116226362A - Word segmentation method for improving accuracy of searching hospital names

Info

Publication number: CN116226362A
Application number: CN202310500980.0A
Authority: CN
Inventors: 罗方义; 吴红曼; 刘雨鑫
Original assignee: Hunan Deya Manda Technology Co ltd
Current assignee: Hunan Deya Manda Technology Co ltd
Priority date: 2023-05-06
Filing date: 2023-05-06
Publication date: 2023-06-06
Anticipated expiration: 2043-05-06
Also published as: CN116226362B

Abstract

The invention discloses a word segmentation method for improving the accuracy of searching hospital names, which belongs to the technical field of hospital information, and the method comprises the steps of decomposing fonts in a text set one by one according to a target hospital common name set to form a text set of a single font; combining fonts in the text set front and back to form word segmentation, matching the word segmentation with a dictionary in a database, and outputting a word segmentation result successfully matched; sequentially displaying the matching results according to the matching degree of the word segmentation results; the invention can check and match one by one according to the input characters of the user and eliminate the ambiguity problem in character word segmentation, thereby greatly improving the accuracy and efficiency of searching and improving the experience of the user.

Description

Word segmentation method for improving accuracy of searching hospital names

Technical Field

The invention discloses a word segmentation method, belongs to the technical field of hospital information, and particularly relates to a word segmentation method for improving accuracy of searching hospital names.

Background

In the popularization of intellectualization and informatization, the user can know various information in the outside world without going home, and can obtain different types of information through searching of internet equipment; so that the information of people can be synchronized; with the advent of the information age, the internet has played an increasing role in various aspects of people's production and life, and for our country using chinese as a native language, chinese information processing technology has taken a very important role in informatization construction of our country.

When the user searches for the hospital name in daily life, because the hospital name is usually longer, if the whole name of the hospital cannot be marked out, a plurality of different hospital names can appear in the search box, and meanwhile, a plurality of hospitals possibly exist in the current city, so that the user cannot determine the accuracy of the hospital, and the experience of the user is reduced.

Chinese patent publication No. CN112199494a discloses a medical information searching method, apparatus, electronic device and storage medium. The method can determine medical inquiry sentences, preprocesses the medical inquiry sentences to obtain word segmentation sequences, wherein the word segmentation sequences comprise a plurality of medical words, a pre-built inverted index table is obtained, an initial text field of each medical word is determined, the medical words in the initial text field are determined to be boundary words, a target text field is determined from the initial text fields, each target text field corresponds to one inquiry dimension, a search library corresponding to a search request is determined according to the inquiry dimension, the medical words are searched in the search library, and the search result of the search request is obtained.

Chinese patent publication No. CN109543178A discloses a method and system for constructing judicial text label system. Obtaining judicial vocabulary texts through a word segmentation tool, constructing a primary tag system according to word frequency statistics, merging tags with similar semantics in the primary tag system, expanding a harsh tag to obtain an expanded tag system, counting the accuracy of searching the texts by the expanded tag system by utilizing a text test set, verifying whether the current expanded tag system is constructed, and otherwise, further optimizing the tag system.

The Chinese patent with publication number of CN111950283A discloses a Chinese word segmentation and named entity recognition system for large-scale medical text mining, word vectors are obtained based on word2vec and segmented text, the word vectors are input into a laminated BiLSTM-CRF model, entity labeling is carried out on the word vectors through a first layer of the laminated BiLSTM-CRF model, part-of-speech features are added into the word vectors after the entity labeling to form an input feature set, and complex named entity recognition is carried out on the input feature set through a second layer of the laminated BiLSTM-CRF model.

The prior art has the following problems: when the target information is segmented, the target information is not decomposed into single characters, and the characters are rearranged, so that information leakage is caused, and the search and the matching are not accurate enough; homonym replacement search is not performed, and the error word search information package degree is not enough; disambiguation of the word is not performed; word segmentation is carried out based on a semantic model, the calculation is complex, the calculation force requirement is high, and when the search calculation requirement of the Internet level is faced, the calculation and operation pressure of the system is high.

Disclosure of Invention

The invention aims to provide a word segmentation method for improving the accuracy of searching hospital names, and solves the defects in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions: a word segmentation method for improving accuracy of searching for hospital names, the word segmentation method comprising the following steps:

s1, establishing a word segmentation set formed by a single word set based on a target hospital common name set, wherein the method specifically comprises the following sub-steps:

s11, establishing a common name set according to the input common names of the target hospitals

；

S12, collecting the common names

The vocabulary and phrases are decomposed one by one to form a single word set ++>

The method comprises the steps of carrying out a first treatment on the surface of the The word set->

, wherein />

To->

Is a single word;

s2, for the single word set

The single characters in the database are combined back and forth to form word segmentation, and the word segmentation is matched with dictionary in the database; comprises the following substeps:

s21, gathering the single words

All the single words of (1) are combined in positive sequence and in reverse sequence to obtain word segmentation set +.>

The word set->

, wherein ,/>

Said->

Is a two word phrase set, the +.>

Is a three word phrase set, the ++>

Is a four-word phrase set, and meets the following conditions:

wherein

；/>

For the initial word +.>

、/>

、/>

The segmentation word consists of an initial word and a following word;

s22, the search field input by the searcher and the word segmentation set are processed

Matching:

s221, if the matching is successful, matching the matched phrase from the word segmentation set

The rest part is used as a new word segmentation set to be repeatedly combined and matched;

s222, if the matching is unsuccessful, selecting word sets from the word segmentation set

One or a plurality of single words are intercepted in the forward direction or the reverse direction to form a character string to be matched, and the character string is matched with a search field until the word segmentation set +.>

Complete or intercept phrase matching inTo the last word->

。

S3, outputting a word segmentation result which is successfully matched;

and S4, displaying the matching results in sequence according to the matching degree of the word segmentation results.

Further, the combined text which cannot be successfully matched is segmented, and ambiguity is eliminated; the method comprises the following specific steps:

s5, determining the text which cannot be successfully matched as the Chinese text Y to be segmented, and performing word segmentation through a forward maximum matching method, a reverse maximum matching method and an HMM to obtain a word segmentation result

The method comprises the steps of carrying out a first treatment on the surface of the The segmentation results of the forward maximum matching method, the reverse maximum matching method and the HMM word segmentation method are respectively marked as +.>

；

S6, marking to obtain the part which is not identical in the three word segmentation results, namely the part which is used as an ambiguity part, by comparing the three word segmentation results;

s7, judging which ambiguity results the ambiguity part belongs to and disambiguating:

s71, first result: if the result is

Or->

Or->

That is, any two of the three word segmentation results are identical, the word segmentation results are +.>

As a final cut;

s72, second result: if the result is

Namely, the three word segmentation results are different from each other, the word segmentation result is +.>

As a final cut;

when the ambiguous result is the second result, the second disambiguation is needed on the basis of the first disambiguation, the part of speech of the three word segmentation results is marked by using the HMM, the ambiguous parts which are different in each word segmentation result are obtained through screening, the maximized segmentation method is obtained through the evaluation function, and the segmentation is used as the final segmentation.

Further, in the case of the common name set

Before word segmentation, the common name set is +.>

Preprocessing, recognizing Chinese and English numbers, domain names and the like with obvious characteristics, and carrying out ++on the common name set>

Filtering text sets of (2), counting word frequency and selecting candidate words, screening Chinese and English numbers and domain names, and filtering for multiple times until no Chinese and English numbers and domain names are selectable.

Further, in the process of integrating the search field with the word segmentation set

When matching, the word segmentation set is +.>

Inserting, indexing and storing characters;

wherein the word segmentation set

The method comprises an initial node, a plurality of intermediate nodes and an end node; the initial node is located in the history recordThe intermediate node is positioned at the phrase which is successfully matched and has the smallest sequence number, the intermediate node is positioned at the phrase which is successfully matched in each time in the history record, and the ending node is positioned at the phrase which is successfully matched and has the largest sequence number in the history record; each time of matching is provided with paths from an initial node to an end node, and a plurality of intermediate nodes exist on each path;

when searching word segmentation set

When a word is stored, the method starts from an initial node, and then traverses along a certain branch until the last word of the word is segmented, and the query is completed.

Further, wherein the word segmentation set

The matching method of (2) is as follows:

acquiring a first character of a search field, finding out an initial node corresponding to the first character, and jumping to an intermediate node of a next character to wait for the next inquiry;

acquiring a second character of the character string to be queried from the intermediate node, and jumping to the intermediate node of the next character again to wait for the next query;

repeating the operation until the last character of the word is used as an ending node;

and reading the information of the last character node, returning all characters of the path through which the information passes, and finishing the inquiry.

Further, when the word segmentation is always unable to be successfully matched, pinyin font matching is performed on all characters in the word segmentation, and each font is obtained

The spelling of the font can be obtained>

And performing combination matching with the initial consonant and the final of the pinyin in the search field. />

The beneficial effects are that: the invention discloses a word segmentation method, which belongs to the technical field of hospital information, and particularly relates to a word segmentation method for improving the accuracy of searching hospital names; combining fonts in the text set front and back to form vocabulary, matching the vocabulary with vocabulary in a database, and outputting a word segmentation result which is successfully matched; sequentially displaying the matching results according to the matching degree of the word segmentation results; the invention can check and match one by one according to the input characters of the user and eliminate the ambiguity problem in character word segmentation, thereby greatly improving the accuracy and efficiency of searching and improving the experience of the user.

Drawings

Fig. 1 is a schematic diagram of the operation of the present invention.

Fig. 2 is a flow chart of the operation of the present invention.

Fig. 3 is a diagram of the disambiguation step of the present invention.

FIG. 4 is a word segmentation matching flow diagram of the present invention.

FIG. 5 is a schematic diagram of word segmentation matching of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

A word segmentation method for improving accuracy of searching hospital names comprises the following steps:

establishing a corresponding text set according to an input target text, and decomposing fonts in the text set one by one to form a text set of a single font;

combining fonts in the text set front and back to form word segmentation, matching the word segmentation with a dictionary in a database, and outputting a word segmentation result successfully matched;

and displaying the matching results in sequence according to the matching degree of the word segmentation results.

In one embodiment, a common name set is established based on the entered common names of the target hospitals

For the common name set +.>

The method comprises the steps of carrying out a first treatment on the surface of the The single word set

, wherein />

To->

Is a single word.

In one embodiment, for the set of words

The method for forming the word segmentation by combining the single characters in the database front and back and matching the word segmentation with the dictionary in the database comprises the following steps:

gathering the single words

The word set->

, wherein ,/>

Said->

Is two (two)Word phrase set, said->

Is a three word phrase set, the ++>

Is a four-word phrase set, and meets the following conditions:

/>

wherein ,

；/>

for the initial word +.>

、/>

、/>

The segmentation word consists of an initial word and a following word;

the search field input by the searcher and the word segmentation set

Matching:

if the matching is successful, matching the word group from the word segmentation set

if the matching is unsuccessful, then the word segmentation set is used for

The phrase matching in the word is completed or intercepted to the last word +.>

；

Outputting a word segmentation result which is successfully matched;

In one embodiment, for some combined texts which cannot be successfully matched, the text needs to be segmented to eliminate ambiguity; the method comprises the following specific steps:

determining a text which cannot be successfully matched as a Chinese text Y to be segmented, and performing word segmentation through a forward maximum matching method, a reverse maximum matching method and an HMM to obtain a word segmentation result

；

The method comprises the steps of marking, namely obtaining a part which is not identical in three word segmentation results, namely being used as an ambiguous part, by comparing the three word segmentation results;

judging which ambiguity results the ambiguity part belongs to and disambiguating:

first result: if the result is

Or->

Or->

As a final cut;

second results: if the result is

As a final cut;

In one embodiment, when a common name is assembled

Before word segmentation, the common name set is required to be +.>

Filtering text sets of (1), counting word frequency and selecting candidate words, screening Chinese and English numbers, domain names and the like, and screening and filtering for multiple times until no Chinese and English numbers and domain names are selectable, wherein the domain names can be distinguished, and the accuracy and the recognition efficiency can be greatly improved.

In one embodiment, the search field is combined with the word segmentation set

When matching, the word segmentation set is processedClosing device

Inserting, indexing and storing characters;

wherein the word segmentation set

The method comprises an initial node, a plurality of intermediate nodes and an end node; the initial node is positioned at the phrase with successful matching and minimum sequence number in the history record, the intermediate node is positioned at the phrase with successful matching each time in the history record, and the ending node is positioned at the phrase with successful matching and maximum sequence number in the history record; each time of matching is provided with paths from an initial node to an end node, and a plurality of intermediate nodes exist on each path;

when searching word segmentation set

In one embodiment, wherein the set of tokens

The matching method of (2) is as follows:

In one embodiment, when the word segmentation is always unable to be successfully matched, pinyin font matching is needed for all characters in the word segmentation, and a database is utilized to perform specific search for pinyin of each font of the text set, so that the same pinyin font is matched;

when the word segmentation is always unable to be successfully matched, performing Pinyin font matching on all characters in the word segmentation, and obtaining each font

The spelling of the font can be obtained>

And performing combination matching with the initial consonant and the final of the pinyin in the search field.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims

1. The word segmentation method for improving the accuracy of searching for the hospital name is characterized by comprising the following steps of:

；

S12, collecting the common names

, wherein />

To->

Is a single word;

s2, for the single word set

s21, gathering the single words

The word set->

, wherein ,/>

Said->

Is a two word phrase set, the +.>

Is a three word phrase set, the ++>

Is a four-word phrase set, and meets the following conditions:

wherein ,

；/>

for the initial word +.>

、/>

、/>

The segmentation word consists of an initial word and a following word;

Matching:

；

S3, outputting a word segmentation result which is successfully matched;

2. The word segmentation method for improving the accuracy of searching for hospital names according to claim 1, wherein the word segmentation method is characterized in that the combined text which cannot be successfully matched is segmented, so that ambiguity is eliminated; the method comprises the following specific steps:

；

s71, first result: if the result is

Or->

Or->

As a final cut;

s72, second result: if the result is

As a final cut;

3. The word segmentation method for improving accuracy of searching for hospital names according to claim 2, wherein in the step of searching for the common name set

Before word segmentation, the common name set is +.>

4. A method for word segmentation to improve accuracy of searching hospital names according to claim 3, wherein the search field is combined with word segmentation set

When matching, the word segmentation set is +.>

Inserting, indexing and storing characters;

wherein the word segmentation set

when searching word segmentation set

5. The word segmentation method for improving accuracy of searching for hospital names according to claim 4, wherein the word segmentation set

The matching method of (2) is as follows:

6. The word segmentation method for improving the accuracy of searching for hospital names according to claim 5, wherein when the word segmentation is always unable to be successfully matched, the spelling font matching is performed on all the characters in the word segmentation, and each font is obtained

The spelling of the font can be obtained>