CN108170682B - Chinese word segmentation method based on professional vocabulary and computing equipment - Google Patents
Chinese word segmentation method based on professional vocabulary and computing equipment Download PDFInfo
- Publication number
- CN108170682B CN108170682B CN201810050618.7A CN201810050618A CN108170682B CN 108170682 B CN108170682 B CN 108170682B CN 201810050618 A CN201810050618 A CN 201810050618A CN 108170682 B CN108170682 B CN 108170682B
- Authority
- CN
- China
- Prior art keywords
- word
- character
- determined
- segmentation
- array
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 115
- 238000000034 method Methods 0.000 title claims abstract description 72
- 238000003491 array Methods 0.000 claims abstract description 6
- 230000001174 ascending effect Effects 0.000 claims abstract description 4
- 238000004891 communication Methods 0.000 description 17
- 238000012545 processing Methods 0.000 description 10
- 208000014674 injury Diseases 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- HEDRZPFGACZZDS-UHFFFAOYSA-N Chloroform Chemical compound ClC(Cl)Cl HEDRZPFGACZZDS-UHFFFAOYSA-N 0.000 description 5
- 208000027418 Wounds and injury Diseases 0.000 description 5
- 230000008901 benefit Effects 0.000 description 5
- 230000006378 damage Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000010365 information processing Effects 0.000 description 4
- VNWKTOKETHGBQD-UHFFFAOYSA-N methane Chemical compound C VNWKTOKETHGBQD-UHFFFAOYSA-N 0.000 description 4
- 208000012260 Accidental injury Diseases 0.000 description 3
- 229960001701 chloroform Drugs 0.000 description 3
- 201000010099 disease Diseases 0.000 description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 3
- MWUXSHHQAYIFBG-UHFFFAOYSA-N Nitric oxide Chemical compound O=[N] MWUXSHHQAYIFBG-UHFFFAOYSA-N 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000007723 transport mechanism Effects 0.000 description 2
- 230000008733 trauma Effects 0.000 description 2
- KZBUYRJDOAKODT-UHFFFAOYSA-N Chlorine Chemical compound ClCl KZBUYRJDOAKODT-UHFFFAOYSA-N 0.000 description 1
- 240000002924 Platycladus orientalis Species 0.000 description 1
- 241000270708 Testudinidae Species 0.000 description 1
- 125000000217 alkyl group Chemical group 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 229910052801 chlorine Inorganic materials 0.000 description 1
- 239000000460 chlorine Substances 0.000 description 1
- 125000004218 chloromethyl group Chemical group [H]C([H])(Cl)* 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 239000000383 hazardous chemical Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a Chinese word segmentation method based on professional vocabularies, which is suitable for being executed in computing equipment and comprises the following steps: the method comprises the steps that a dictionary with a preset structure is constructed by reading entries item by item, wherein entries with the same first character in the dictionary are arranged in an ascending order according to a Unicode code, a plurality of first arrays are established for storing the entries with the same first character, at least one second array is established in each first array for storing entry content and identification positions, and the identification positions are used for identifying whether the entries belong to professional vocabularies or not; searching one or more character strings in the sentence to be segmented in a dictionary by utilizing a binary search method to obtain a plurality of to-be-determined segmented words after primary segmentation; setting word segmentation weight for each word to be determined according to the identification bit corresponding to the word to be determined; and constructing a segmentation path according to the multiple to-be-determined word segmentations and the word segmentation weight thereof, and selecting the shortest path as a word segmentation result. The invention also discloses a computing device for executing the method.
Description
Technical Field
The invention relates to the technical field of information processing, in particular to a Chinese word segmentation method and computing equipment based on professional vocabularies.
Background
The Chinese information processing technology is widely applied in the computer fields of computer networks, database technologies, software engineering and the like, Chinese automatic word segmentation is an important basic work of Chinese information processing, and many Chinese information processing projects relate to word segmentation problems, such as machine translation, automatic abstractions, automatic classification, full-text retrieval of Chinese document libraries, search engines and the like. Because the Chinese text is continuous writing and no space exists between words, in the Chinese text processing, the first problem is the problem of word segmentation, and the correct segmentation of words is the necessary condition for Chinese text processing. In addition, the Chinese word segmentation method is not limited to Chinese application, but also applied to English processing, such as handwriting recognition, the spaces between words are not clear, and the Chinese word segmentation method can help to judge the boundaries of English words. Therefore, the research on Chinese word segmentation technology has great significance.
Although the basic expression units of modern Chinese are 'words' and many words are double-word or multi-word, the boundaries of words and phrases are difficult to distinguish due to different levels of understanding. For example, "punish on everywhere," who is a word or a phrase, "may have different criteria among different persons, and the same" sea, "" brewery, "etc., may make different judgments even for the same person. The dictionary adopted by the existing Chinese word segmentation technology is relatively universal, a dictionary specially aiming at professional vocabularies is not provided, and the word segmentation result is very inaccurate.
Therefore, a Chinese word segmentation method capable of recognizing professional vocabularies is needed, so that the word segmentation accuracy is further improved.
Disclosure of Invention
To this end, the present invention provides a professional vocabulary based Chinese segmentation method and computing device in an attempt to solve or at least alleviate the above-identified problems.
According to one aspect of the invention, a professional vocabulary based Chinese word segmentation method is provided, which is suitable for being executed in a computing device and comprises the following steps: the method comprises the steps that a dictionary with a preset structure is constructed by reading entries item by item, wherein entries with the same first character in the dictionary are arranged in an ascending order according to a Unicode code, a plurality of first arrays are established for storing the entries with the same first character, at least one second array is established in each first array for storing entry content and identification bits, and the identification bits are used for identifying whether the entries belong to professional vocabularies or not; searching one or more character strings in the sentence to be segmented in a dictionary by utilizing a binary search method to obtain a plurality of to-be-determined segmented words after primary segmentation; setting word segmentation weight for each word to be determined according to the identification bit corresponding to the word to be determined; and constructing a segmentation path according to the multiple to-be-determined word segmentations and the word segmentation weight thereof, and selecting the shortest path as a word segmentation result.
Optionally, in the method according to the present invention, the step of setting a segmentation weight for each to-be-determined segmentation word according to the identification bit corresponding to the to-be-determined segmentation word includes: if the identification position corresponding to the word to be determined indicates that the word to be determined belongs to the professional vocabulary, setting a first word dividing weight for the word to be determined; and if the identification position corresponding to the word to be determined indicates that the word to be determined does not belong to the professional vocabulary, setting a second word segmentation weight for the word to be determined, wherein the first word segmentation weight is less than the second word segmentation weight.
Optionally, in the method according to the present invention, the step of constructing a segmentation path according to a plurality of to-be-determined participles and the participle weights thereof and selecting a shortest path as a participle result includes: taking each character in the sentence to be segmented as a node, wherein the first character of the sentence to be segmented is a starting node, and the last character of the sentence to be segmented is a termination node; sequentially constructing a plurality of segmentation paths between an initial node and a termination node according to the word segmentation to be determined; calculating the length of each segmentation path by combining the segmentation weight of each to-be-determined segmentation; and selecting a segmentation path with the shortest length as a word segmentation result.
Alternatively, in the method according to the present invention, the step of constructing a dictionary having a predetermined structure by reading in entries entry by entry includes: establishing an input stream to read entries in sequence; judging whether a first array for storing the entry with the first character of the entry as the first character exists; if the first array does not exist, creating a first array for storing all the entries with the first character as the first character according to the read first character of the entry; establishing a second array in the first array to store the entry content; judging whether the entry belongs to a professional vocabulary or not, and if so, giving a first numerical value to the identification position of the entry; and if the word is not the professional word, giving a second numerical value to the identification position.
Optionally, in the method according to the present invention, before the step of searching for one or more character strings in a sentence to be segmented in a dictionary by using a binary search method to obtain a plurality of initially segmented words to be determined, the method further includes the steps of: identifying non-Chinese characters in a source sentence to be processed; and removing the identified non-Chinese characters from the source sentences to be processed to obtain the sentences to be segmented.
Optionally, in the method according to the invention, the non-chinese characters comprise punctuation marks, numeric characters, english characters, non-visible characters ignoring actions.
Optionally, in the method according to the present invention, the step of searching one or more character strings in the sentence to be segmented in the dictionary by using a binary search method to obtain a plurality of to-be-determined segmented words after the initial segmentation includes: for each character in the sentence to be participled: searching a first array of entries which store the characters as first characters according to the Unicode codes of the characters; forming at least one character string by taking the character as a first character, and searching the character string in all entries of the first array by a binary search method; and when the entry corresponding to the character string is found, taking the character string as a word to be determined.
Optionally, in the method according to the present invention, at least one character string is formed with the character as the first character, and the step of searching for the character string in all entries of the first array by the binary search method further includes: if the entry only comprising the character exists in the first array, judging the character as a whole word; and using the character as a word to be determined.
According to another aspect of the present invention, there is provided a computing device comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods described above.
According to a further aspect of the invention, there is provided a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods described above.
According to the Chinese word segmentation scheme based on the professional vocabularies, the identification position indicating whether the entry is the professional vocabularies or not is added when the dictionary is built, then a smaller word segmentation weight can be set for the word to be determined which is judged to be the professional vocabularies during word segmentation, the length of the segmentation path is calculated according to the word segmentation weight and the segmentation path, and then the shortest path is selected as the word segmentation result. By introducing the scoring mechanism, the possible path selection problem is solved, the accuracy of the word segmentation result is ensured, not only can the cross ambiguity be better solved, but also the recognition rate of professional vocabularies in the professional field is higher, and the technology is applied to different industries to obtain higher word segmentation accuracy.
Drawings
To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.
FIG. 1 shows a block diagram of a computing device 100, according to one embodiment of the invention;
FIG. 2 illustrates a flow diagram of a professional vocabulary based Chinese segmentation method 200, according to one embodiment of the present invention; and
fig. 3 shows a flowchart of constructing a dictionary having a predetermined structure according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
FIG. 1 shows a block diagram of a computing device 100, according to one embodiment of the invention.
In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.
Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a digital information processor (DSP), or any combination thereof. The processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations the memory controller 118 may be an internal part of the processor 104.
Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more applications 122, and program data 124. In some implementations, the application 122 can be arranged to execute instructions on the operating system 120 by one or more processors 104 using program data 124.
Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.
A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
Computing device 100 may be implemented as a server, such as a file server, a database server, an application server, a WEB server, etc., or as part of a small-form factor portable (or mobile) electronic device, such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless WEB-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations.
In an implementation according to the invention, the computing device 100 is configured to perform a professional vocabulary based Chinese segmentation method according to the invention. Among other things, one or more applications 122 of the computing device 100 include instructions for performing the specialized vocabulary based Chinese segmentation method 200 in accordance with the present invention.
FIG. 2 illustrates a flow diagram of a professional vocabulary based Chinese segmentation method 200, according to one embodiment of the present invention.
The method 200 starts in step S210 by constructing a dictionary with a predetermined structure by reading in entries one by one.
According to one embodiment of the present invention, a dictionary is constructed in which entries having the same first word are arranged in ascending order of the Unicode code built in Java, i.e., in the order from "one" to "tortoise".
Since Unicode codes contain some meaningless words, such as "" i "" and the like, if all Unicode codes are loaded at one time without screening, space resources are wasted, and the number of times of subsequent query matching is increased. Therefore, in the dictionary with the predetermined structure according to the present invention, a plurality of first arrays are established for storing entries having the same first character, and at least one second array is established in each first array, each second array is used for storing the content of an entry and an identification bit, and the identification bit is used for identifying whether the entry belongs to a professional vocabulary. In other words, all entries with the same first character form a word block (i.e. first arrays), and in each first array, a plurality of second arrays are further formed, each second array comprising a string constant and an integer constant, wherein the string constant is used for storing the contents of the entry, and the integer constant is used for storing the flag bit.
Table 1 shows one form of dictionary structure according to an embodiment of the present invention.
TABLE 1
A |
First stage |
One brake |
All at once |
One-five-one-ten |
Nitric oxide |
…… |
Sky |
Sky |
Weather (weather) |
Tianan door |
Chinese arborvitae flower shaped like Chinese character' ji |
…… |
Embodiments of the present invention also provide a process of constructing a dictionary having a predetermined structure by reading entries one by one, as shown in fig. 3.
In the form of a file stream, in step S310, an input stream is established to sequentially read entries, and it is determined whether the end of the input stream is reached, if the end of the input stream is reached, all entries are read, and if not, the following steps are continuously performed.
Then, in step S320, it is determined whether or not a first array for storing the entry with the first character of the entry exists for the read entry. For example, if the entry read in is "reason", it is necessary to determine whether the first array with "way" as the first character exists in the current dictionary.
In step S330, if there is no such first array, a first array for storing all the entries with the first character as the first character is created according to the first character of the read entry. That is, if there is no first array with "track" as the first character, a first array is created in the dictionary to store all entries with "track" as the first character.
Next, in step S340, a second array is established in the first array to store the corresponding entry content. Of course, if it is determined that the dictionary originally has the first array with the first word "track", the process proceeds directly to step S340, where a second array is created in the first array for storing the "reason" of the entry.
Then, in step S350, it is determined whether the current entry belongs to a professional vocabulary, and if so, a first numerical value is assigned to the identification position; if the word is not a professional word, a second numerical value is given to the identification bit of the word, and the identification bit is written into a second array. Alternatively, the first numerical value is represented by 00 and the second numerical value is represented by 01, or the first numerical value is represented by 9 and the second numerical value is represented by 1, and the flag is only required to clearly distinguish between professional vocabulary and non-professional vocabulary, which is not limited by the embodiment of the present invention.
Alternatively, professional vocabularies in different professional fields can be distinguished by giving different values to the identification bits, for example, for professional vocabularies in the hazardous chemical industry, the identification bit is set to 9; for professional vocabularies in the radio and television industry, the identification position is set to be 8. The embodiments of the present invention are not limited thereto.
And then circularly entering the step S310, continuously reading the next vocabulary entry, and executing the steps S320 to S350 until the end of the input stream is reached and the dictionary is completely built.
Then, in step S220, a binary search method is used to search one or more character strings in the sentence to be segmented in the dictionary to obtain a plurality of to-be-determined segmented words after the initial segmentation.
According to one implementation mode of the invention, for a source sentence to be processed, non-Chinese characters in the source sentence are firstly identified, and then the identified non-Chinese characters are removed from the source sentence to be processed, so as to obtain a sentence to be segmented. Optionally, the non-chinese characters include punctuation, numeric characters, english characters, non-visible characters that ignore actions such as line feed, carriage return, horizontal tab, and the like. This provides basic language information for subsequent algorithmic processing and increases processing efficiency.
Specifically, step S220 may be performed as follows: for each character in the sentence to be segmented, searching a first array of entries which store the character as a first character according to the Unicode code of the character; forming at least one character string by taking the character as a first character, and searching the character string in all entries of the first array by a binary search method; and when the entry corresponding to the character string is found, taking the character string as a word to be determined.
For example, the source sentences to be processed are:
insurance benefit plan for group personal accident injury
Accidental injury: refers to the subject suffering from an external, sudden, involuntary, non-disease, objective event that harms the body. "
The sentence to be segmented is obtained by identifying the non-Chinese characters in the sentence:
"group personal injury insurance benefits plan accidental injury refers to an objective physical injury event that is subject to an external sudden unexpected non-disease"
Then, taking the first character "group" in the sentence to be participled as an example, looking up a first array of entries in the dictionary, which takes the "group" as the first character, looking up whether the entries such as the "group" or the "group" exist in the first array by a binary search method, and taking the character string "group" as a participle to be determined after finding out that the entries "group" exist in the first array. And executing the searching process on each other character until the last character to obtain a plurality of to-be-determined participles after the initial segmentation.
According to another embodiment of the present invention, if there is an entry only including the character in the first array, the character is determined to be a whole word (generally, a single character can be a word that is a word by itself to be a whole word), and then the character is regarded as a word to be segmented.
After the processing in step S220, the above source sentences to be processed may obtain the following participles to be determined:
"team, personal, Accident, trauma, injury, insurance, benefit, plan, Accident, trauma, injury, means, suffered, external, sudden, non, instinct, non, disease, cause, physical, injury, objective, incident"
And acquiring the corresponding identification bit from the second array for storing each word to be determined, and then setting the word segmentation weight for each word to be determined according to the identification bit corresponding to each word to be determined in step S230.
According to one implementation mode, if the identification position corresponding to the word to be determined indicates that the word to be determined belongs to a professional vocabulary, setting a first word segmentation weight for the word to be determined; and if the identification position corresponding to the word to be determined indicates that the word to be determined does not belong to the professional vocabulary, setting a second word segmentation weight for the word to be determined, wherein the first word segmentation weight is smaller than the second word segmentation weight. Optionally, the first segmentation weight is set to 0.5 and the second segmentation weight is set to 1.
Then, in step S240, a segmentation path is constructed according to a plurality of to-be-determined participles and the participle weights thereof, and the shortest path is selected as a participle result.
According to one implementation, word segmentation is performed using a shortest path word segmentation algorithm. According to one embodiment of the present invention, the execution process of constructing the splitting path is specifically described as follows:
1) and taking each character in the sentence to be segmented as a node, wherein the first character of the sentence to be segmented is taken as a starting node, and the last character is taken as a termination node.
2) And constructing a plurality of segmentation paths between the starting node and the terminating node in sequence according to the plurality of to-be-determined participles obtained in the step S230.
3) And calculating the length of each segmentation path by combining the segmentation weight of each to-be-determined segmentation word, wherein the length of each segmentation path is obtained by counting the score of the corresponding edge of each word segmented in the path.
If the word segmentation weight is not considered, the edge corresponding to each word counts 1, but if a word is more likely to form words with other words (i.e., contains non-word morphemes), the edge corresponding to the word counts 1 (i.e., counts 2), for example, "min", "true". On this basis, if the edge corresponding to a certain word is counted as x score and the corresponding word segmentation weight is y, after the consideration of the word segmentation weight is added, the score of the corresponding edge is as follows: x y.
4) And selecting a segmentation path with the shortest length as a word segmentation result.
Let, the sentence to be participled be: the accident of chloroform leakage of one tank car in a certain city of a certain province.
And constructing a plurality of segmentation paths between the starting node and the terminating node by taking each character as a node as follows:
(ii) province/city/tank car/tri/chloro/methyl/alkyl/leakage/accident
② a province/a city/a tank car/tri/chlorine/methane/leakage/accident
③ province/city/tank car/chloroform/leakage/accident
Wherein, methane and trichloromethane belong to professional vocabularies and correspond to the first segmentation weight (e.g. 0.5), and other words correspond to the second segmentation weight (e.g. 1).
The lengths corresponding to the three splitting paths are respectively as follows:
①1+1+1+1+1+2+1+2+1+1=12;
②1+1+1+1+1+2+1*0.5+1+1=9.5;
③1+1+1+1+1*0.5+1+1=6.5。
in summary, the segmentation result corresponding to the third segmentation path with the shortest length is selected as the word segmentation result.
According to the Chinese word segmentation scheme based on the professional vocabularies, the identification position indicating whether the entry is the professional vocabularies or not is added when the dictionary is built, then a smaller word segmentation weight can be set for the word to be determined which is judged to be the professional vocabularies during word segmentation, the length of the segmentation path is calculated according to the word segmentation weight and the segmentation path, and then the shortest path is selected as the word segmentation result. By introducing the scoring mechanism, the possible path selection problem is solved, the accuracy of the word segmentation result is ensured, not only can the cross ambiguity be better solved, but also the recognition rate of professional vocabularies in the professional field is higher, and the technology is applied to different industries to obtain higher word segmentation accuracy.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules or units or groups of devices in the examples disclosed herein may be arranged in a device as described in this embodiment, or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. Modules or units or groups in embodiments may be combined into one module or unit or group and may furthermore be divided into sub-modules or sub-units or sub-groups. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.
The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.
In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the method of the invention according to instructions in said program code stored in the memory.
By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.
Claims (9)
1. A professional vocabulary based chinese word segmentation method, the method adapted to be executed in a computing device, the method comprising the steps of:
the method comprises the steps that a dictionary with a preset structure is constructed by reading entries item by item, wherein entries with the same first character in the dictionary are arranged in an ascending order according to a Unicode code, a plurality of first arrays are established for storing entries with the same first character, at least one second array is established in each first array for storing entry content and identification bits, and the identification bits are used for identifying whether the entries belong to professional vocabularies or not;
searching one or more character strings in the sentence to be segmented in the dictionary by utilizing a binary search method to obtain a plurality of to-be-determined segmented words after primary segmentation;
setting word segmentation weight for each word to be determined according to the identification bit corresponding to the word to be determined, comprising the following steps: if the identification position corresponding to the word to be determined indicates that the word to be determined belongs to the professional vocabulary, setting a first word segmentation weight for the word to be determined, and if the identification position corresponding to the word to be determined indicates that the word to be determined does not belong to the professional vocabulary, setting a second word segmentation weight for the word to be determined, wherein the first word segmentation weight is smaller than the second word segmentation weight; and
and constructing a segmentation path according to the multiple to-be-determined word segmentations and the word segmentation weight thereof, and selecting the shortest path as a word segmentation result.
2. The method of claim 1, wherein the step of constructing a segmentation path according to a plurality of to-be-determined participles and participle weights thereof and selecting a shortest path as a participle result comprises:
taking each character in the sentence to be segmented as a node, wherein the first character of the sentence to be segmented is a starting node, and the last character of the sentence to be segmented is a termination node;
sequentially constructing a plurality of segmentation paths between an initial node and a termination node according to the word segmentation to be determined;
calculating the length of each segmentation path by combining the segmentation weight of each to-be-determined segmentation; and
and selecting a segmentation path with the shortest length as a word segmentation result.
3. The method according to claim 1 or 2, wherein the step of constructing a dictionary having a predetermined structure by reading in entries entry by entry comprises:
establishing an input stream to read entries in sequence;
judging whether a first array for storing the entry with the first character of the entry as the first character exists;
if the first array does not exist, creating a first array for storing all the entries with the first character as the first character according to the read first character of the entry;
establishing a second array in the first array to store the entry content;
judging whether the entry belongs to a professional vocabulary or not, and if so, giving a first numerical value to the identification position of the entry; and
if the word is not a professional word, a second numerical value is given to the identification position.
4. The method as claimed in claim 1 or 2, wherein before the step of searching one or more character strings in the sentence to be segmented in the dictionary by using the binary search method to obtain the plurality of segmented words to be determined after the initial segmentation, the method further comprises the steps of:
identifying non-Chinese characters in a source sentence to be processed; and
and removing the identified non-Chinese characters from the source sentences to be processed to obtain the sentences to be segmented.
5. The method of claim 4, wherein the non-Chinese characters include punctuation, numeric characters, English characters, non-visible characters that ignore actions.
6. The method as claimed in claim 1 or 2, wherein the step of searching one or more character strings in the sentence to be segmented in the dictionary by using a binary search method to obtain a plurality of segmented words to be determined after the initial segmentation comprises:
for each character in the sentence to be participled:
searching a first array of entries which store the characters as first characters according to the Unicode codes of the characters;
forming at least one character string by taking the character as a first character, and searching the character string in all entries of the first array by a binary search method; and
and when the entry corresponding to the character string is found, taking the character string as a word to be determined.
7. The method of claim 6, wherein the step of forming at least one character string with the character as the initial character, and searching all entries in the first array for the character string by binary search further comprises:
if the entry only comprising the character exists in the first array, judging the character as a whole word; and
and taking the character as a word to be determined.
8. A computing device, comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-7.
9. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810050618.7A CN108170682B (en) | 2018-01-18 | 2018-01-18 | Chinese word segmentation method based on professional vocabulary and computing equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810050618.7A CN108170682B (en) | 2018-01-18 | 2018-01-18 | Chinese word segmentation method based on professional vocabulary and computing equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108170682A CN108170682A (en) | 2018-06-15 |
CN108170682B true CN108170682B (en) | 2021-09-07 |
Family
ID=62515230
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810050618.7A Expired - Fee Related CN108170682B (en) | 2018-01-18 | 2018-01-18 | Chinese word segmentation method based on professional vocabulary and computing equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108170682B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110825608B (en) * | 2018-08-08 | 2024-08-16 | 北京京东尚科信息技术有限公司 | Critical semantic testing method and device, storage medium and electronic equipment |
CN109522740B (en) * | 2018-10-16 | 2021-04-20 | 易保互联医疗信息科技(北京)有限公司 | Health data privacy removal processing method and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6879951B1 (en) * | 1999-07-29 | 2005-04-12 | Matsushita Electric Industrial Co., Ltd. | Chinese word segmentation apparatus |
CN103838794A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Word segmentation method suitable for specialized search engine |
CN105159949A (en) * | 2015-08-12 | 2015-12-16 | 北京京东尚科信息技术有限公司 | Chinese address word segmentation method and system |
-
2018
- 2018-01-18 CN CN201810050618.7A patent/CN108170682B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6879951B1 (en) * | 1999-07-29 | 2005-04-12 | Matsushita Electric Industrial Co., Ltd. | Chinese word segmentation apparatus |
CN103838794A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Word segmentation method suitable for specialized search engine |
CN105159949A (en) * | 2015-08-12 | 2015-12-16 | 北京京东尚科信息技术有限公司 | Chinese address word segmentation method and system |
Non-Patent Citations (1)
Title |
---|
"基于N-最短路径方法的中文词语粗分模型";张华平等;《中文信息学报》;20020925;第16卷(第5期);第2.1、2.2节,第四章,图1 * |
Also Published As
Publication number | Publication date |
---|---|
CN108170682A (en) | 2018-06-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9798393B2 (en) | Text correction processing | |
CN107193921B (en) | Method and system for correcting error of Chinese-English mixed query facing search engine | |
US9223779B2 (en) | Text segmentation with multiple granularity levels | |
CN101388012B (en) | Phonetic check system and method with easy confusion tone recognition | |
CN102929870B (en) | A kind of set up the method for participle model, the method for participle and device thereof | |
CN107977347B (en) | Topic duplication removing method and computing equipment | |
US20100180199A1 (en) | Detecting name entities and new words | |
US20040243408A1 (en) | Method and apparatus using source-channel models for word segmentation | |
CN110795628B (en) | Search term processing method and device based on correlation and computing equipment | |
CN111651990B (en) | Entity identification method, computing device and readable storage medium | |
CN107967256B (en) | Word weight prediction model generation method, position recommendation method and computing device | |
CN110083681B (en) | Searching method, device and terminal based on data analysis | |
CN111930929A (en) | Article title generation method and device and computing equipment | |
US8725497B2 (en) | System and method for detecting and correcting mismatched Chinese character | |
US20110258202A1 (en) | Concept extraction using title and emphasized text | |
CN113435186A (en) | Chinese text error correction system, method, device and computer readable storage medium | |
CN111832299A (en) | Chinese word segmentation system | |
CN109086266B (en) | Error detection and correction method for text-shaped near characters | |
JP2014186395A (en) | Document preparation support device, method, and program | |
US8306329B2 (en) | System and method for searching handwritten texts | |
CN108170682B (en) | Chinese word segmentation method based on professional vocabulary and computing equipment | |
CN114861635B (en) | Chinese spelling error correction method, device, equipment and storage medium | |
JP2017004127A (en) | Text segmentation program, text segmentation device, and text segmentation method | |
CN109189907A (en) | A kind of search method and device based on semantic matches | |
AlGahtani et al. | Arabic part-of-speech tagging using transformation-based learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210907 |