CN116226362A - Word segmentation method for improving accuracy of searching hospital names - Google Patents

Word segmentation method for improving accuracy of searching hospital names Download PDF

Info

Publication number
CN116226362A
CN116226362A CN202310500980.0A CN202310500980A CN116226362A CN 116226362 A CN116226362 A CN 116226362A CN 202310500980 A CN202310500980 A CN 202310500980A CN 116226362 A CN116226362 A CN 116226362A
Authority
CN
China
Prior art keywords
word segmentation
word
matching
segmentation
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310500980.0A
Other languages
Chinese (zh)
Other versions
CN116226362B (en
Inventor
罗方义
吴红曼
刘雨鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Deya Manda Technology Co ltd
Original Assignee
Hunan Deya Manda Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Deya Manda Technology Co ltd filed Critical Hunan Deya Manda Technology Co ltd
Priority to CN202310500980.0A priority Critical patent/CN116226362B/en
Publication of CN116226362A publication Critical patent/CN116226362A/en
Application granted granted Critical
Publication of CN116226362B publication Critical patent/CN116226362B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a word segmentation method for improving the accuracy of searching hospital names, which belongs to the technical field of hospital information, and the method comprises the steps of decomposing fonts in a text set one by one according to a target hospital common name set to form a text set of a single font; combining fonts in the text set front and back to form word segmentation, matching the word segmentation with a dictionary in a database, and outputting a word segmentation result successfully matched; sequentially displaying the matching results according to the matching degree of the word segmentation results; the invention can check and match one by one according to the input characters of the user and eliminate the ambiguity problem in character word segmentation, thereby greatly improving the accuracy and efficiency of searching and improving the experience of the user.

Description

Word segmentation method for improving accuracy of searching hospital names
Technical Field
The invention discloses a word segmentation method, belongs to the technical field of hospital information, and particularly relates to a word segmentation method for improving accuracy of searching hospital names.
Background
In the popularization of intellectualization and informatization, the user can know various information in the outside world without going home, and can obtain different types of information through searching of internet equipment; so that the information of people can be synchronized; with the advent of the information age, the internet has played an increasing role in various aspects of people's production and life, and for our country using chinese as a native language, chinese information processing technology has taken a very important role in informatization construction of our country.
When the user searches for the hospital name in daily life, because the hospital name is usually longer, if the whole name of the hospital cannot be marked out, a plurality of different hospital names can appear in the search box, and meanwhile, a plurality of hospitals possibly exist in the current city, so that the user cannot determine the accuracy of the hospital, and the experience of the user is reduced.
Chinese patent publication No. CN112199494a discloses a medical information searching method, apparatus, electronic device and storage medium. The method can determine medical inquiry sentences, preprocesses the medical inquiry sentences to obtain word segmentation sequences, wherein the word segmentation sequences comprise a plurality of medical words, a pre-built inverted index table is obtained, an initial text field of each medical word is determined, the medical words in the initial text field are determined to be boundary words, a target text field is determined from the initial text fields, each target text field corresponds to one inquiry dimension, a search library corresponding to a search request is determined according to the inquiry dimension, the medical words are searched in the search library, and the search result of the search request is obtained.
Chinese patent publication No. CN109543178A discloses a method and system for constructing judicial text label system. Obtaining judicial vocabulary texts through a word segmentation tool, constructing a primary tag system according to word frequency statistics, merging tags with similar semantics in the primary tag system, expanding a harsh tag to obtain an expanded tag system, counting the accuracy of searching the texts by the expanded tag system by utilizing a text test set, verifying whether the current expanded tag system is constructed, and otherwise, further optimizing the tag system.
The Chinese patent with publication number of CN111950283A discloses a Chinese word segmentation and named entity recognition system for large-scale medical text mining, word vectors are obtained based on word2vec and segmented text, the word vectors are input into a laminated BiLSTM-CRF model, entity labeling is carried out on the word vectors through a first layer of the laminated BiLSTM-CRF model, part-of-speech features are added into the word vectors after the entity labeling to form an input feature set, and complex named entity recognition is carried out on the input feature set through a second layer of the laminated BiLSTM-CRF model.
The prior art has the following problems: when the target information is segmented, the target information is not decomposed into single characters, and the characters are rearranged, so that information leakage is caused, and the search and the matching are not accurate enough; homonym replacement search is not performed, and the error word search information package degree is not enough; disambiguation of the word is not performed; word segmentation is carried out based on a semantic model, the calculation is complex, the calculation force requirement is high, and when the search calculation requirement of the Internet level is faced, the calculation and operation pressure of the system is high.
Disclosure of Invention
The invention aims to provide a word segmentation method for improving the accuracy of searching hospital names, and solves the defects in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: a word segmentation method for improving accuracy of searching for hospital names, the word segmentation method comprising the following steps:
s1, establishing a word segmentation set formed by a single word set based on a target hospital common name set, wherein the method specifically comprises the following sub-steps:
s11, establishing a common name set according to the input common names of the target hospitals
Figure SMS_1
S12, collecting the common names
Figure SMS_2
The vocabulary and phrases are decomposed one by one to form a single word set ++>
Figure SMS_3
The method comprises the steps of carrying out a first treatment on the surface of the The word set->
Figure SMS_4
, wherein />
Figure SMS_5
To->
Figure SMS_6
Is a single word;
s2, for the single word set
Figure SMS_7
The single characters in the database are combined back and forth to form word segmentation, and the word segmentation is matched with dictionary in the database; comprises the following substeps:
s21, gathering the single words
Figure SMS_8
All the single words of (1) are combined in positive sequence and in reverse sequence to obtain word segmentation set +.>
Figure SMS_9
The word set->
Figure SMS_10
, wherein ,/>
Figure SMS_11
Said->
Figure SMS_12
Is a two word phrase set, the +.>
Figure SMS_13
Is a three word phrase set, the ++>
Figure SMS_14
Is a four-word phrase set, and meets the following conditions:
Figure SMS_17
Figure SMS_19
Figure SMS_21
wherein
Figure SMS_15
;/>
Figure SMS_18
For the initial word +.>
Figure SMS_20
、/>
Figure SMS_22
、/>
Figure SMS_16
The segmentation word consists of an initial word and a following word;
s22, the search field input by the searcher and the word segmentation set are processed
Figure SMS_23
Matching:
s221, if the matching is successful, matching the matched phrase from the word segmentation set
Figure SMS_24
The rest part is used as a new word segmentation set to be repeatedly combined and matched;
s222, if the matching is unsuccessful, selecting word sets from the word segmentation set
Figure SMS_25
One or a plurality of single words are intercepted in the forward direction or the reverse direction to form a character string to be matched, and the character string is matched with a search field until the word segmentation set +.>
Figure SMS_26
Complete or intercept phrase matching inTo the last word->
Figure SMS_27
S3, outputting a word segmentation result which is successfully matched;
and S4, displaying the matching results in sequence according to the matching degree of the word segmentation results.
Further, the combined text which cannot be successfully matched is segmented, and ambiguity is eliminated; the method comprises the following specific steps:
s5, determining the text which cannot be successfully matched as the Chinese text Y to be segmented, and performing word segmentation through a forward maximum matching method, a reverse maximum matching method and an HMM to obtain a word segmentation result
Figure SMS_28
The method comprises the steps of carrying out a first treatment on the surface of the The segmentation results of the forward maximum matching method, the reverse maximum matching method and the HMM word segmentation method are respectively marked as +.>
Figure SMS_29
S6, marking to obtain the part which is not identical in the three word segmentation results, namely the part which is used as an ambiguity part, by comparing the three word segmentation results;
s7, judging which ambiguity results the ambiguity part belongs to and disambiguating:
s71, first result: if the result is
Figure SMS_30
Or->
Figure SMS_31
Or->
Figure SMS_32
That is, any two of the three word segmentation results are identical, the word segmentation results are +.>
Figure SMS_33
As a final cut;
s72, second result: if the result is
Figure SMS_34
Namely, the three word segmentation results are different from each other, the word segmentation result is +.>
Figure SMS_35
As a final cut;
when the ambiguous result is the second result, the second disambiguation is needed on the basis of the first disambiguation, the part of speech of the three word segmentation results is marked by using the HMM, the ambiguous parts which are different in each word segmentation result are obtained through screening, the maximized segmentation method is obtained through the evaluation function, and the segmentation is used as the final segmentation.
Further, in the case of the common name set
Figure SMS_36
Before word segmentation, the common name set is +.>
Figure SMS_37
Preprocessing, recognizing Chinese and English numbers, domain names and the like with obvious characteristics, and carrying out ++on the common name set>
Figure SMS_38
Filtering text sets of (2), counting word frequency and selecting candidate words, screening Chinese and English numbers and domain names, and filtering for multiple times until no Chinese and English numbers and domain names are selectable.
Further, in the process of integrating the search field with the word segmentation set
Figure SMS_39
When matching, the word segmentation set is +.>
Figure SMS_40
Inserting, indexing and storing characters;
wherein the word segmentation set
Figure SMS_41
The method comprises an initial node, a plurality of intermediate nodes and an end node; the initial node is located in the history recordThe intermediate node is positioned at the phrase which is successfully matched and has the smallest sequence number, the intermediate node is positioned at the phrase which is successfully matched in each time in the history record, and the ending node is positioned at the phrase which is successfully matched and has the largest sequence number in the history record; each time of matching is provided with paths from an initial node to an end node, and a plurality of intermediate nodes exist on each path;
when searching word segmentation set
Figure SMS_42
When a word is stored, the method starts from an initial node, and then traverses along a certain branch until the last word of the word is segmented, and the query is completed.
Further, wherein the word segmentation set
Figure SMS_43
The matching method of (2) is as follows:
acquiring a first character of a search field, finding out an initial node corresponding to the first character, and jumping to an intermediate node of a next character to wait for the next inquiry;
acquiring a second character of the character string to be queried from the intermediate node, and jumping to the intermediate node of the next character again to wait for the next query;
repeating the operation until the last character of the word is used as an ending node;
and reading the information of the last character node, returning all characters of the path through which the information passes, and finishing the inquiry.
Further, when the word segmentation is always unable to be successfully matched, pinyin font matching is performed on all characters in the word segmentation, and each font is obtained
Figure SMS_44
The spelling of the font can be obtained>
Figure SMS_45
And performing combination matching with the initial consonant and the final of the pinyin in the search field. />
The beneficial effects are that: the invention discloses a word segmentation method, which belongs to the technical field of hospital information, and particularly relates to a word segmentation method for improving the accuracy of searching hospital names; combining fonts in the text set front and back to form vocabulary, matching the vocabulary with vocabulary in a database, and outputting a word segmentation result which is successfully matched; sequentially displaying the matching results according to the matching degree of the word segmentation results; the invention can check and match one by one according to the input characters of the user and eliminate the ambiguity problem in character word segmentation, thereby greatly improving the accuracy and efficiency of searching and improving the experience of the user.
Drawings
Fig. 1 is a schematic diagram of the operation of the present invention.
Fig. 2 is a flow chart of the operation of the present invention.
Fig. 3 is a diagram of the disambiguation step of the present invention.
FIG. 4 is a word segmentation matching flow diagram of the present invention.
FIG. 5 is a schematic diagram of word segmentation matching of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
A word segmentation method for improving accuracy of searching hospital names comprises the following steps:
establishing a corresponding text set according to an input target text, and decomposing fonts in the text set one by one to form a text set of a single font;
combining fonts in the text set front and back to form word segmentation, matching the word segmentation with a dictionary in a database, and outputting a word segmentation result successfully matched;
and displaying the matching results in sequence according to the matching degree of the word segmentation results.
In one embodiment, a common name set is established based on the entered common names of the target hospitals
Figure SMS_46
For the common name set +.>
Figure SMS_47
The vocabulary and phrases are decomposed one by one to form a single word set ++>
Figure SMS_48
The method comprises the steps of carrying out a first treatment on the surface of the The single word set
Figure SMS_49
, wherein />
Figure SMS_50
To->
Figure SMS_51
Is a single word.
In one embodiment, for the set of words
Figure SMS_52
The method for forming the word segmentation by combining the single characters in the database front and back and matching the word segmentation with the dictionary in the database comprises the following steps:
gathering the single words
Figure SMS_53
All the single words of (1) are combined in positive sequence and in reverse sequence to obtain word segmentation set +.>
Figure SMS_54
The word set->
Figure SMS_55
, wherein ,/>
Figure SMS_56
Said->
Figure SMS_57
Is two (two)Word phrase set, said->
Figure SMS_58
Is a three word phrase set, the ++>
Figure SMS_59
Is a four-word phrase set, and meets the following conditions:
Figure SMS_61
/>
Figure SMS_63
Figure SMS_65
wherein ,
Figure SMS_62
;/>
Figure SMS_64
for the initial word +.>
Figure SMS_66
、/>
Figure SMS_67
、/>
Figure SMS_60
The segmentation word consists of an initial word and a following word;
the search field input by the searcher and the word segmentation set
Figure SMS_68
Matching:
if the matching is successful, matching the word group from the word segmentation set
Figure SMS_69
The rest part is used as a new word segmentation set to be repeatedly combined and matched;
if the matching is unsuccessful, then the word segmentation set is used for
Figure SMS_70
One or a plurality of single words are intercepted in the forward direction or the reverse direction to form a character string to be matched, and the character string is matched with a search field until the word segmentation set +.>
Figure SMS_71
The phrase matching in the word is completed or intercepted to the last word +.>
Figure SMS_72
Outputting a word segmentation result which is successfully matched;
and displaying the matching results in sequence according to the matching degree of the word segmentation results.
In one embodiment, for some combined texts which cannot be successfully matched, the text needs to be segmented to eliminate ambiguity; the method comprises the following specific steps:
determining a text which cannot be successfully matched as a Chinese text Y to be segmented, and performing word segmentation through a forward maximum matching method, a reverse maximum matching method and an HMM to obtain a word segmentation result
Figure SMS_73
The method comprises the steps of carrying out a first treatment on the surface of the The segmentation results of the forward maximum matching method, the reverse maximum matching method and the HMM word segmentation method are respectively marked as +.>
Figure SMS_74
The method comprises the steps of marking, namely obtaining a part which is not identical in three word segmentation results, namely being used as an ambiguous part, by comparing the three word segmentation results;
judging which ambiguity results the ambiguity part belongs to and disambiguating:
first result: if the result is
Figure SMS_75
Or->
Figure SMS_76
Or->
Figure SMS_77
That is, any two of the three word segmentation results are identical, the word segmentation results are +.>
Figure SMS_78
As a final cut;
second results: if the result is
Figure SMS_79
Namely, the three word segmentation results are different from each other, the word segmentation result is +.>
Figure SMS_80
As a final cut;
when the ambiguous result is the second result, the second disambiguation is needed on the basis of the first disambiguation, the part of speech of the three word segmentation results is marked by using the HMM, the ambiguous parts which are different in each word segmentation result are obtained through screening, the maximized segmentation method is obtained through the evaluation function, and the segmentation is used as the final segmentation.
In one embodiment, when a common name is assembled
Figure SMS_81
Before word segmentation, the common name set is required to be +.>
Figure SMS_82
Preprocessing, recognizing Chinese and English numbers, domain names and the like with obvious characteristics, and carrying out ++on the common name set>
Figure SMS_83
Filtering text sets of (1), counting word frequency and selecting candidate words, screening Chinese and English numbers, domain names and the like, and screening and filtering for multiple times until no Chinese and English numbers and domain names are selectable, wherein the domain names can be distinguished, and the accuracy and the recognition efficiency can be greatly improved.
In one embodiment, the search field is combined with the word segmentation set
Figure SMS_84
When matching, the word segmentation set is processedClosing device
Figure SMS_85
Inserting, indexing and storing characters;
wherein the word segmentation set
Figure SMS_86
The method comprises an initial node, a plurality of intermediate nodes and an end node; the initial node is positioned at the phrase with successful matching and minimum sequence number in the history record, the intermediate node is positioned at the phrase with successful matching each time in the history record, and the ending node is positioned at the phrase with successful matching and maximum sequence number in the history record; each time of matching is provided with paths from an initial node to an end node, and a plurality of intermediate nodes exist on each path;
when searching word segmentation set
Figure SMS_87
When a word is stored, the method starts from an initial node, and then traverses along a certain branch until the last word of the word is segmented, and the query is completed.
In one embodiment, wherein the set of tokens
Figure SMS_88
The matching method of (2) is as follows:
acquiring a first character of a search field, finding out an initial node corresponding to the first character, and jumping to an intermediate node of a next character to wait for the next inquiry;
acquiring a second character of the character string to be queried from the intermediate node, and jumping to the intermediate node of the next character again to wait for the next query;
repeating the operation until the last character of the word is used as an ending node;
and reading the information of the last character node, returning all characters of the path through which the information passes, and finishing the inquiry.
In one embodiment, when the word segmentation is always unable to be successfully matched, pinyin font matching is needed for all characters in the word segmentation, and a database is utilized to perform specific search for pinyin of each font of the text set, so that the same pinyin font is matched;
when the word segmentation is always unable to be successfully matched, performing Pinyin font matching on all characters in the word segmentation, and obtaining each font
Figure SMS_89
The spelling of the font can be obtained>
Figure SMS_90
And performing combination matching with the initial consonant and the final of the pinyin in the search field.
It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims (6)

1. The word segmentation method for improving the accuracy of searching for the hospital name is characterized by comprising the following steps of:
s1, establishing a word segmentation set formed by a single word set based on a target hospital common name set, wherein the method specifically comprises the following sub-steps:
s11, establishing a common name set according to the input common names of the target hospitals
Figure QLYQS_1
S12, collecting the common names
Figure QLYQS_2
The vocabulary and phrases are decomposed one by one to form a single word set ++>
Figure QLYQS_3
The method comprises the steps of carrying out a first treatment on the surface of the The word set->
Figure QLYQS_4
, wherein />
Figure QLYQS_5
To->
Figure QLYQS_6
Is a single word;
s2, for the single word set
Figure QLYQS_7
The single characters in the database are combined back and forth to form word segmentation, and the word segmentation is matched with dictionary in the database; comprises the following substeps:
s21, gathering the single words
Figure QLYQS_8
All the single words of (1) are combined in positive sequence and in reverse sequence to obtain word segmentation set +.>
Figure QLYQS_9
The word set->
Figure QLYQS_10
, wherein ,/>
Figure QLYQS_11
Said->
Figure QLYQS_12
Is a two word phrase set, the +.>
Figure QLYQS_13
Is a three word phrase set, the ++>
Figure QLYQS_14
Is a four-word phrase set, and meets the following conditions:
Figure QLYQS_15
Figure QLYQS_18
Figure QLYQS_21
wherein ,
Figure QLYQS_17
;/>
Figure QLYQS_19
for the initial word +.>
Figure QLYQS_20
、/>
Figure QLYQS_22
、/>
Figure QLYQS_16
The segmentation word consists of an initial word and a following word;
s22, the search field input by the searcher and the word segmentation set are processed
Figure QLYQS_23
Matching:
s221, if the matching is successful, matching the matched phrase from the word segmentation set
Figure QLYQS_24
The rest part is used as a new word segmentation set to be repeatedly combined and matched;
s222, if the matching is unsuccessful, selecting word sets from the word segmentation set
Figure QLYQS_25
One or a plurality of single words are intercepted in the forward direction or the reverse direction to form a character string to be matched, and the character string is matched with a search field until the word segmentation set +.>
Figure QLYQS_26
The phrase matching in the word is completed or intercepted to the last word +.>
Figure QLYQS_27
S3, outputting a word segmentation result which is successfully matched;
and S4, displaying the matching results in sequence according to the matching degree of the word segmentation results.
2. The word segmentation method for improving the accuracy of searching for hospital names according to claim 1, wherein the word segmentation method is characterized in that the combined text which cannot be successfully matched is segmented, so that ambiguity is eliminated; the method comprises the following specific steps:
s5, determining the text which cannot be successfully matched as the Chinese text Y to be segmented, and performing word segmentation through a forward maximum matching method, a reverse maximum matching method and an HMM to obtain a word segmentation result
Figure QLYQS_28
The method comprises the steps of carrying out a first treatment on the surface of the The segmentation results of the forward maximum matching method, the reverse maximum matching method and the HMM word segmentation method are respectively marked as +.>
Figure QLYQS_29
S6, marking to obtain the part which is not identical in the three word segmentation results, namely the part which is used as an ambiguity part, by comparing the three word segmentation results;
s7, judging which ambiguity results the ambiguity part belongs to and disambiguating:
s71, first result: if the result is
Figure QLYQS_30
Or->
Figure QLYQS_31
Or->
Figure QLYQS_32
That is, any two of the three word segmentation results are identical, the word segmentation results are +.>
Figure QLYQS_33
As a final cut;
s72, second result: if the result is
Figure QLYQS_34
Namely, the three word segmentation results are different from each other, the word segmentation result is +.>
Figure QLYQS_35
As a final cut;
when the ambiguous result is the second result, the second disambiguation is needed on the basis of the first disambiguation, the part of speech of the three word segmentation results is marked by using the HMM, the ambiguous parts which are different in each word segmentation result are obtained through screening, the maximized segmentation method is obtained through the evaluation function, and the segmentation is used as the final segmentation.
3. The word segmentation method for improving accuracy of searching for hospital names according to claim 2, wherein in the step of searching for the common name set
Figure QLYQS_36
Before word segmentation, the common name set is +.>
Figure QLYQS_37
Preprocessing, recognizing Chinese and English numbers, domain names and the like with obvious characteristics, and carrying out ++on the common name set>
Figure QLYQS_38
Filtering text sets of (2), counting word frequency and selecting candidate words, screening Chinese and English numbers and domain names, and filtering for multiple times until no Chinese and English numbers and domain names are selectable.
4. A method for word segmentation to improve accuracy of searching hospital names according to claim 3, wherein the search field is combined with word segmentation set
Figure QLYQS_39
When matching, the word segmentation set is +.>
Figure QLYQS_40
Inserting, indexing and storing characters;
wherein the word segmentation set
Figure QLYQS_41
The method comprises an initial node, a plurality of intermediate nodes and an end node; the initial node is positioned at the phrase with successful matching and minimum sequence number in the history record, the intermediate node is positioned at the phrase with successful matching each time in the history record, and the ending node is positioned at the phrase with successful matching and maximum sequence number in the history record; each time of matching is provided with paths from an initial node to an end node, and a plurality of intermediate nodes exist on each path;
when searching word segmentation set
Figure QLYQS_42
When a word is stored, the method starts from an initial node, and then traverses along a certain branch until the last word of the word is segmented, and the query is completed.
5. The word segmentation method for improving accuracy of searching for hospital names according to claim 4, wherein the word segmentation set
Figure QLYQS_43
The matching method of (2) is as follows:
acquiring a first character of a search field, finding out an initial node corresponding to the first character, and jumping to an intermediate node of a next character to wait for the next inquiry;
acquiring a second character of the character string to be queried from the intermediate node, and jumping to the intermediate node of the next character again to wait for the next query;
repeating the operation until the last character of the word is used as an ending node;
and reading the information of the last character node, returning all characters of the path through which the information passes, and finishing the inquiry.
6. The word segmentation method for improving the accuracy of searching for hospital names according to claim 5, wherein when the word segmentation is always unable to be successfully matched, the spelling font matching is performed on all the characters in the word segmentation, and each font is obtained
Figure QLYQS_44
The spelling of the font can be obtained>
Figure QLYQS_45
And performing combination matching with the initial consonant and the final of the pinyin in the search field. />
CN202310500980.0A 2023-05-06 2023-05-06 Word segmentation method for improving accuracy of searching hospital names Active CN116226362B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310500980.0A CN116226362B (en) 2023-05-06 2023-05-06 Word segmentation method for improving accuracy of searching hospital names

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310500980.0A CN116226362B (en) 2023-05-06 2023-05-06 Word segmentation method for improving accuracy of searching hospital names

Publications (2)

Publication Number Publication Date
CN116226362A true CN116226362A (en) 2023-06-06
CN116226362B CN116226362B (en) 2023-07-18

Family

ID=86571606

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310500980.0A Active CN116226362B (en) 2023-05-06 2023-05-06 Word segmentation method for improving accuracy of searching hospital names

Country Status (1)

Country Link
CN (1) CN116226362B (en)

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000004459A1 (en) * 1998-07-15 2000-01-27 Microsoft Corporation Proper name identification in chinese
JP2000200291A (en) * 1998-12-29 2000-07-18 Xerox Corp Method for automatically detecting selected character string in text
JP2001043221A (en) * 1999-07-29 2001-02-16 Matsushita Electric Ind Co Ltd Chinese word dividing device
CN101071420A (en) * 2007-06-22 2007-11-14 腾讯科技(深圳)有限公司 Method and system for cutting index participle
CN101655841A (en) * 2009-09-28 2010-02-24 章森 Recursion method for word omni-segmentation of Chinese text
CN101882163A (en) * 2010-06-30 2010-11-10 中国科学院地理科学与资源研究所 Fuzzy Chinese address geographic evaluation method based on matching rule
AU2013219188A1 (en) * 2007-01-04 2013-09-12 Thinking Solutions Pty Ltd Linguistic Analysis
CN103678684A (en) * 2013-12-25 2014-03-26 沈阳美行科技有限公司 Chinese word segmentation method based on navigation information retrieval
CN107918604A (en) * 2017-11-13 2018-04-17 彩讯科技股份有限公司 A kind of Chinese segmenting method and device
CN108538395A (en) * 2018-04-02 2018-09-14 上海市儿童医院 A kind of construction method of general medical disease that calls for specialized treatment data system
WO2018201600A1 (en) * 2017-05-05 2018-11-08 平安科技(深圳)有限公司 Information mining method and system, electronic device and readable storage medium
JP2018206261A (en) * 2017-06-08 2018-12-27 日本電信電話株式会社 Word division estimation model learning device, word division device, method and program
CN109753516A (en) * 2019-01-31 2019-05-14 北京嘉和美康信息技术有限公司 A kind of sort method and relevant apparatus of case history search result
CN110287488A (en) * 2019-06-18 2019-09-27 上海晏鼠计算机技术股份有限公司 A kind of Chinese text segmenting method based on big data and Chinese feature
CN112988753A (en) * 2021-03-31 2021-06-18 建信金融科技有限责任公司 Data searching method and device
CN113065350A (en) * 2021-04-13 2021-07-02 哈尔滨理工大学 Biomedical text word sense disambiguation method based on attention neural network
WO2021135910A1 (en) * 2020-06-24 2021-07-08 平安科技(深圳)有限公司 Machine reading comprehension-based information extraction method and related device
CN113392189A (en) * 2021-08-17 2021-09-14 东华理工大学南昌校区 News text processing method based on automatic word segmentation
CN114154494A (en) * 2021-11-24 2022-03-08 南方电网数字电网研究院有限公司 Disambiguation word segmentation method, system, device and storage medium
US11520989B1 (en) * 2018-05-17 2022-12-06 Workday, Inc. Natural language processing with keywords

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000004459A1 (en) * 1998-07-15 2000-01-27 Microsoft Corporation Proper name identification in chinese
JP2000200291A (en) * 1998-12-29 2000-07-18 Xerox Corp Method for automatically detecting selected character string in text
JP2001043221A (en) * 1999-07-29 2001-02-16 Matsushita Electric Ind Co Ltd Chinese word dividing device
AU2013219188A1 (en) * 2007-01-04 2013-09-12 Thinking Solutions Pty Ltd Linguistic Analysis
CN101071420A (en) * 2007-06-22 2007-11-14 腾讯科技(深圳)有限公司 Method and system for cutting index participle
CN101655841A (en) * 2009-09-28 2010-02-24 章森 Recursion method for word omni-segmentation of Chinese text
CN101882163A (en) * 2010-06-30 2010-11-10 中国科学院地理科学与资源研究所 Fuzzy Chinese address geographic evaluation method based on matching rule
CN103678684A (en) * 2013-12-25 2014-03-26 沈阳美行科技有限公司 Chinese word segmentation method based on navigation information retrieval
WO2018201600A1 (en) * 2017-05-05 2018-11-08 平安科技(深圳)有限公司 Information mining method and system, electronic device and readable storage medium
JP2018206261A (en) * 2017-06-08 2018-12-27 日本電信電話株式会社 Word division estimation model learning device, word division device, method and program
CN107918604A (en) * 2017-11-13 2018-04-17 彩讯科技股份有限公司 A kind of Chinese segmenting method and device
CN108538395A (en) * 2018-04-02 2018-09-14 上海市儿童医院 A kind of construction method of general medical disease that calls for specialized treatment data system
US11520989B1 (en) * 2018-05-17 2022-12-06 Workday, Inc. Natural language processing with keywords
CN109753516A (en) * 2019-01-31 2019-05-14 北京嘉和美康信息技术有限公司 A kind of sort method and relevant apparatus of case history search result
CN110287488A (en) * 2019-06-18 2019-09-27 上海晏鼠计算机技术股份有限公司 A kind of Chinese text segmenting method based on big data and Chinese feature
WO2021135910A1 (en) * 2020-06-24 2021-07-08 平安科技(深圳)有限公司 Machine reading comprehension-based information extraction method and related device
CN112988753A (en) * 2021-03-31 2021-06-18 建信金融科技有限责任公司 Data searching method and device
CN113065350A (en) * 2021-04-13 2021-07-02 哈尔滨理工大学 Biomedical text word sense disambiguation method based on attention neural network
CN113392189A (en) * 2021-08-17 2021-09-14 东华理工大学南昌校区 News text processing method based on automatic word segmentation
CN114154494A (en) * 2021-11-24 2022-03-08 南方电网数字电网研究院有限公司 Disambiguation word segmentation method, system, device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
唐涛: "面向特定领域的中文分词技术的研究", 中国优秀硕士论文电子期刊网, pages 1 - 56 *

Also Published As

Publication number Publication date
CN116226362B (en) 2023-07-18

Similar Documents

Publication Publication Date Title
US11475209B2 (en) Device, system, and method for extracting named entities from sectioned documents
CN105718586B (en) The method and device of participle
US8447588B2 (en) Region-matching transducers for natural language processing
US9195646B2 (en) Training data generation apparatus, characteristic expression extraction system, training data generation method, and computer-readable storage medium
US8266169B2 (en) Complex queries for corpus indexing and search
US5794177A (en) Method and apparatus for morphological analysis and generation of natural language text
Kumar et al. Part of speech taggers for morphologically rich indian languages: a survey
US8510097B2 (en) Region-matching transducers for text-characterization
CN112035730B (en) Semantic retrieval method and device and electronic equipment
WO2008107305A2 (en) Search-based word segmentation method and device for language without word boundary tag
Zhikov et al. An efficient algorithm for unsupervised word segmentation with branching entropy and MDL
Bellare et al. Learning extractors from unlabeled text using relevant databases
CN112417891B (en) Text relation automatic labeling method based on open type information extraction
CN106383814A (en) Word segmentation method of English social media short text
CN109213998A (en) Chinese wrongly written character detection method and system
CN112447172B (en) Quality improvement method and device for voice recognition text
Shafi et al. UNLT: Urdu natural language toolkit
CN114298010A (en) Text generation method integrating dual-language model and sentence detection
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN112765977A (en) Word segmentation method and device based on cross-language data enhancement
Hirpassa Information extraction system for Amharic text
CN116226362B (en) Word segmentation method for improving accuracy of searching hospital names
CN115983233A (en) Electronic medical record duplication rate estimation method based on data stream matching
CN115858733A (en) Cross-language entity word retrieval method, device, equipment and storage medium
CN115618883A (en) Business semantic recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant