CN116933783A - Scientific and technological vocabulary word segmentation method and system based on maximum likelihood probability - Google Patents

Scientific and technological vocabulary word segmentation method and system based on maximum likelihood probability Download PDF

Info

Publication number
CN116933783A
CN116933783A CN202310690365.0A CN202310690365A CN116933783A CN 116933783 A CN116933783 A CN 116933783A CN 202310690365 A CN202310690365 A CN 202310690365A CN 116933783 A CN116933783 A CN 116933783A
Authority
CN
China
Prior art keywords
word
word segmentation
segmented
input character
maximum likelihood
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310690365.0A
Other languages
Chinese (zh)
Inventor
何军
赵燕
胡俊松
徐旻昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai R&d Public Service Platform Management Center
Original Assignee
Shanghai R&d Public Service Platform Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai R&d Public Service Platform Management Center filed Critical Shanghai R&d Public Service Platform Management Center
Priority to CN202310690365.0A priority Critical patent/CN116933783A/en
Publication of CN116933783A publication Critical patent/CN116933783A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to the technical field of word segmentation algorithms, and provides a scientific and technological vocabulary word segmentation method based on maximum likelihood probability, which comprises the following steps: s1: acquiring an offline dictionary in the same field as a scientific and technological vocabulary input character string of a word to be segmented, wherein the offline dictionary comprises two columns including a word and a word frequency; s2: constructing a prefix dictionary of the technological vocabulary input character strings to be segmented in a memory based on the offline dictionary; s3: based on the prefix dictionary, segmenting the technological vocabulary input character strings to be segmented to construct a directed acyclic graph; s4: and acquiring all word segmentation paths of the technical vocabulary input character strings to be segmented based on the directed acyclic graph, and calculating the word segmentation path with the maximum likelihood probability in the word segmentation paths as a word segmentation result of the technical vocabulary input character strings to be segmented. And calculating an optimal segmentation path based on the maximum likelihood probability, and obtaining an optimal word segmentation result.

Description

Scientific and technological vocabulary word segmentation method and system based on maximum likelihood probability
Technical Field
The application relates to the technical field of word segmentation algorithms, in particular to a scientific and technological vocabulary word segmentation method and system based on maximum likelihood probability.
Background
Segmentation is the process of segmenting the word into individual words or phrases. In natural language processing, word segmentation is an important preprocessing step, and has important significance for subsequent text processing and analysis tasks. The meaning of a word may vary in different contexts depending on the purpose and application of the word. The following are some common word segmentation meanings:
language understanding and semantic analysis: by decomposing a segment of speech into words, the structure and semantics of sentences can be better understood. This facilitates natural language processing tasks such as part-of-speech tagging, syntactic analysis, semantic role tagging, etc., to extract more semantic information and context.
Information retrieval and search: in information retrieval and search engines, word segmentation of query sentences may split a query into separate keywords to more accurately match and retrieve related documents or web pages. This helps to improve the relevance and accuracy of the search results.
Machine translation: word segmentation is critical to the task of machine translation. Decomposing the source language sentence into words may better correspond to words or phrases in the target language, thereby helping the machine translation system to perform accurate translations.
Text mining and information extraction: the segmentation may provide a basis for text mining and information extraction tasks. By segmenting the text into words, information such as keywords, entity nouns, phrases and the like can be better identified, thereby helping mining and extracting specific information.
Text classification and emotion analysis: in text classification and emotion analysis, the segmentation can convert the text into discrete feature representations for classification, emotion judgment and other tasks. The segmentation of text into words can provide richer feature information, helping to improve the accuracy of classification and emotion analysis.
In summary, word segmentation of a word has the meaning of segmenting continuous text into discrete words, thereby providing more accurate and rich language expressions and feature representations for subsequent natural language processing tasks.
In the prior art, predefined rules are typically employed to segment sentences. For example, the segmentation may be based on spaces, punctuation, or specific segmentors. The method is simple and direct, but because each word has a plurality of different prefix segmentation modes, an optimal segmentation mode is not found, and the segmentation result is not optimal.
Disclosure of Invention
Aiming at the problems, the application aims to provide a scientific vocabulary word segmentation method and a system based on maximum likelihood probability, which calculate an optimal segmentation path based on the maximum likelihood probability and acquire an optimal word segmentation result.
The above object of the present application is achieved by the following technical solutions:
a scientific and technological vocabulary word segmentation method based on maximum likelihood probability comprises the following steps:
s1: acquiring an offline dictionary in the same field as a scientific and technological vocabulary input character string of a word to be segmented, wherein the offline dictionary comprises two columns including a word and a word frequency;
s2: constructing a prefix dictionary of the technological vocabulary input character strings to be segmented in a memory based on the offline dictionary;
s3: based on the prefix dictionary, segmenting the technological vocabulary input character strings to be segmented to construct a directed acyclic graph;
s4: and acquiring all word segmentation paths of the technical vocabulary input character strings to be segmented based on the directed acyclic graph, and calculating the word segmentation path with the maximum likelihood probability in the word segmentation paths as a word segmentation result of the technical vocabulary input character strings to be segmented.
Further, before step S1, the method further includes establishing the offline dictionary, specifically:
s11: acquiring a large-scale text corpus, wherein the text corpus comprises text data in different fields;
s12: performing text preprocessing on the text corpus, wherein the text preprocessing comprises the removal of punctuation marks, special characters and numbers;
s13: word segmentation is carried out on the text corpus after the text pretreatment, and word segmentation results are obtained;
s14: traversing each word in the word segmentation result, and counting the word frequency of each word;
s15: and storing each word in the word segmentation result and the corresponding word frequency in a one-to-one correspondence manner.
Further, in step S2, the prefix dictionary of the technical vocabulary input character string to be segmented is built in a memory based on the offline dictionary, specifically:
s21: sequentially obtaining each word of the technological vocabulary input character string to be segmented;
s22, acquiring all prefixes of words based on each word;
s23: traversing the word frequency of all the prefixes of each word in the offline dictionary, wherein the word frequency takes the word frequency in the offline dictionary when the prefixes are in the offline dictionary, and takes 0 when the prefixes are not in the offline dictionary.
Further, in step S3, based on the prefix dictionary, the technological vocabulary input character string to be segmented is segmented, and the directed acyclic graph is constructed, specifically:
for independent words without prefixes in the technical vocabulary input character string of the words to be segmented, only one segmentation mode exists, the independent words are formed, and for words with prefixes in the technical vocabulary input character string of the words to be segmented, all segmentation modes are listed;
the internal structure of the directed acyclic graph is as follows:
0:[q 1 ,q 2 ...q n ];
1:[q 1 ,q 2 ...q n ];
...
m-1:[q 1 ,q 2 ...q n ];
wherein 0 to m-1 represent the positions of single words in the technical vocabulary input character string of the word to be segmented in sentences of the technical vocabulary input character string of the word to be segmented, and each time increment is 1 from 0 until the last position m-1 in the sentences is the word number; q 1 To q n The span of word segmentation results for words beginning with the current word and n is the number of word segments for words beginning with the current word.
Further, in step S4, all word segmentation paths of the technical vocabulary input character string to be segmented are obtained based on the directed acyclic graph, and the word segmentation path with the maximum likelihood probability in the word segmentation paths is calculated as the word segmentation result of the technical vocabulary input character string to be segmented, specifically:
carrying out path planning on the directed acyclic graph by adopting a dynamic path optimization algorithm, wherein the method comprises reverse searching optimization and forward solving;
searching from the end point to the starting point of the directed acyclic graph by adopting the reverse optimizing method, calculating weights of all word segmentation paths from the word at the current searching position to the end point by adopting the forward solving method aiming at the word at the current searching position in the directed acyclic graph in the searching process, and acquiring the maximum likelihood probability from the word at the current searching position to the end point according to the weights, wherein the word segmentation path corresponding to the maximum likelihood probability is used as the final word segmentation path from the word at the current searching position to the end point;
the reverse optimizing method is adopted to search continuously to a starting point on the basis of the final word segmentation path from the word at the current searching position to the end point, the forward solving method is adopted to calculate the weights of all the word segmentation paths based on the final word segmentation path determined by the last searching position from the word at the next searching position to the end point, the maximum likelihood probability from the word at the next searching position to the end point is obtained according to the weights, the word segmentation path corresponding to the maximum likelihood probability is used as the final word segmentation path from the word at the next searching position to the end point until the starting point is searched, and the final word segmentation path of the complete technological vocabulary input character string to be segmented is obtained;
and taking the final word segmentation path of the whole technological vocabulary input character string to be segmented as the word segmentation result of the technological vocabulary input character string to be segmented.
Further, the weight of the word segmentation path is calculated by the following steps:
wherein w is i Weights f for each arrow representing a span in the directed acyclic graph i For the word frequency of each word, f all For the total word frequency in the prefix dictionary, ln is the natural logarithm of the weight to avoid the weight value being too small;
route=w 1 +w 2 +...+w n =∑w i
the weight route of the word segmentation path is the sum of the weights of each word on the head-to-tail path of the word segmentation path.
Further, the maximum likelihood probability is obtained, specifically:
and calculating the maximum value in the weights of all the word segmentation paths as the maximum likelihood probability of the word to the end point of the current search position.
A maximum likelihood based scientific vocabulary word segmentation system for performing the above-described maximum likelihood based scientific vocabulary word segmentation method, comprising:
the system comprises an offline dictionary acquisition module, a word segmentation module and a word segmentation module, wherein the offline dictionary acquisition module is used for acquiring an offline dictionary in the same field as a scientific and technological vocabulary input character string of a word to be segmented, and the offline dictionary comprises two columns including a word and a word frequency;
the prefix dictionary construction module is used for constructing a prefix dictionary of the technological vocabulary input character strings to be segmented in a memory based on the offline dictionary;
the directed acyclic graph construction module is used for cutting the technological vocabulary input character strings to be segmented on the basis of the prefix dictionary to construct a directed acyclic graph;
and the word segmentation result output module is used for acquiring all word segmentation paths of the technical vocabulary input character strings to be segmented based on the directed acyclic graph, and calculating the word segmentation path with the maximum likelihood probability in the word segmentation paths as the word segmentation result of the technical vocabulary input character strings to be segmented.
A computer device comprising a memory and one or more processors, the memory having stored therein computer code which, when executed by the one or more processors, causes the one or more processors to perform a method as described above.
A computer readable storage medium storing computer code which, when executed, performs a method as described above.
Compared with the prior art, the application has at least one of the following beneficial effects:
(1) The scientific and technological vocabulary word segmentation method based on the maximum likelihood probability comprises the following steps: s1: acquiring an offline dictionary in the same field as a scientific and technological vocabulary input character string of a word to be segmented, wherein the offline dictionary comprises two columns including a word and a word frequency; s2: constructing a prefix dictionary of the technological vocabulary input character strings to be segmented in a memory based on the offline dictionary; s3: based on the prefix dictionary, segmenting the technological vocabulary input character strings to be segmented to construct a directed acyclic graph; s4: and acquiring all word segmentation paths of the technical vocabulary input character strings to be segmented based on the directed acyclic graph, and calculating the word segmentation path with the maximum likelihood probability in the word segmentation paths as a word segmentation result of the technical vocabulary input character strings to be segmented. According to the technical scheme, the prefix dictionary is built based on the existing offline dictionary comprising word frequency, likelihood probability is calculated for each path based on likelihood probability algorithm, and the optimal segmentation path is obtained to obtain the optimal segmentation result.
(2) By establishing the word segmentation path based on the offline dictionary of word frequency, the word segmentation result is more close to the actual word segmentation habit, and the word segmentation result is more accurate.
Drawings
FIG. 1 is a general flow chart of a technique vocabulary word segmentation method based on maximum likelihood probability;
FIG. 2 is a directed acyclic graph constructed by the application "measurement of crude fiber in plant based food";
FIG. 3 is a directed acyclic graph constructed according to the application "assay of vegetable proteins";
FIG. 4 is a graph showing the result of the calculation of the maximum likelihood probability of "measurement of plant protein" according to the present application;
FIG. 5 is a graph showing the result of the maximum likelihood probability calculation of crude fiber measurement in plant food
FIG. 6 is an overall block diagram of a technique vocabulary word segmentation method based on maximum likelihood probability according to the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
First embodiment
As shown in fig. 1, the present embodiment provides a scientific vocabulary word segmentation method based on maximum likelihood probability, which includes the following steps:
s1: and acquiring an offline dictionary in the same field as the scientific and technological vocabulary input character strings of the words to be segmented, wherein the offline dictionary comprises two columns including words and word frequencies.
Specifically, the offline dictionary may be an existing offline dictionary downloaded from another route, or may be an offline dictionary established by itself. For self-building of an offline dictionary, the following method can be used:
s11: a large-scale text corpus is obtained, and the text corpus comprises text data in different fields. The text corpus may be news, books, wikipedia, or other sources.
S12: and performing text preprocessing on the text corpus, wherein punctuation marks, special characters and numbers are removed. May be implemented using conventional text processing tools or programming languages.
S13: and performing word segmentation on the text corpus subjected to the text pretreatment to obtain word segmentation results. Such as using a maximum matching algorithm, or a machine learning based approach, such as a conditional random field (CR F) or a Recurrent Neural Network (RNN), etc.
S14: traversing each word in the word segmentation result, and counting the word frequency of each word. A hash table, field, or other data structure may be used to record words and their frequency information.
S15: and storing each word in the word segmentation result and the corresponding word frequency in a one-to-one correspondence manner.
In addition, the offline dictionary can be optimized, and some optimization operations can be performed according to requirements, for example: removing low frequency words: words with too low a frequency, which may be noise or unimportant information, are deleted. Merging words: adjacent and frequently occurring words are combined into phrases or proper nouns, so that the word segmentation accuracy is improved. The optimized dictionary is stored in a proper format, such as text file, binary file or database. Common data serialization methods such as JSON, CSV, or jackle may be used. When word segmentation is needed, the offline dictionary is loaded into the memory so as to perform quick query.
S2: constructing a prefix dictionary of the technological vocabulary input character string to be segmented in a memory based on the offline dictionary, wherein the prefix dictionary specifically comprises the following steps of:
s21: and sequentially acquiring each word of the technological vocabulary input character string to be segmented.
And S22, acquiring all prefixes of the words based on each word.
S23: traversing the word frequency of all the prefixes of each word in the offline dictionary, wherein the word frequency takes the word frequency in the offline dictionary when the prefixes are in the offline dictionary, and takes 0 when the prefixes are not in the offline dictionary.
S3: based on the prefix dictionary, the technological vocabulary input character string to be segmented is segmented, and a directed acyclic graph is constructed, specifically:
for independent words without prefixes in the technical vocabulary input character string of the words to be segmented, only one segmentation mode exists, the independent words are formed, and for words with prefixes in the technical vocabulary input character string of the words to be segmented, all segmentation modes are listed;
the internal structure of the directed acyclic graph is as follows:
0:[q 1 ,q 2 ...q n ];
1:[q 1 ,q 2 ...q n ];
...
m-1:[q 1 ,q 2 ...q n ];
wherein 0 to m-1 represent the positions of single words in the technical vocabulary input character string of the word to be segmented in sentences of the technical vocabulary input character string of the word to be segmented, and each time increment is 1 from 0 until the last position m-1 in the sentences is the word number; q 1 To q n The span of word segmentation results for words beginning with the current word and n is the number of word segments for words beginning with the current word.
S4: based on the directed acyclic graph, acquiring all word segmentation paths of the technical vocabulary input character strings to be segmented, and calculating the word segmentation path with the maximum likelihood probability in the word segmentation paths as a word segmentation result of the technical vocabulary input character strings to be segmented, wherein the word segmentation result comprises the following specific steps:
and carrying out path planning on the directed acyclic graph by adopting a dynamic path optimization algorithm, wherein the method comprises reverse searching optimization and forward solving.
And searching from the end point to the starting point of the directed acyclic graph by adopting the reverse optimizing method, calculating weights of all word segmentation paths from the word at the current searching position to the end point by adopting the forward solving method aiming at the word at the current searching position in the directed acyclic graph in the searching process, and acquiring the maximum likelihood probability from the word at the current searching position to the end point according to the weights, wherein the word segmentation path corresponding to the maximum likelihood probability is used as the final word segmentation path from the word at the current searching position to the end point.
And adopting the reverse optimizing method to continue searching towards a starting point based on the final word segmentation path from the word at the current searching position to the end point, adopting the forward solving method to calculate the weights of all the word segmentation paths based on the final word segmentation path determined by the last searching position from the word at the next searching position to the end point, and acquiring the maximum likelihood probability from the word at the next searching position to the end point according to the weights, wherein the word segmentation path corresponding to the maximum likelihood probability is used as the final word segmentation path from the word at the next searching position to the end point until the starting point is searched, so as to acquire the final word segmentation path of the complete technological vocabulary input character string to be segmented.
And taking the final word segmentation path of the whole technological vocabulary input character string to be segmented as the word segmentation result of the technological vocabulary input character string to be segmented.
In step S4, the weight of the word segmentation path is calculated by the following method:
wherein w is i Weights f for each arrow representing a span in the directed acyclic graph i For the word frequency of each word, f all For the total word frequency in the prefix dictionary, ln is the natural logarithm of the weight to avoid the weight value being too small;
route=w 1 +w 2 +...+w n =∑w i
the weight route of the word segmentation path is the sum of the weights of each word on the head-to-tail path of the word segmentation path.
In step S4, the maximum likelihood probability is obtained, specifically:
and calculating the maximum value in the weights of all the word segmentation paths as the maximum likelihood probability of the word to the end point of the current search position.
Second embodiment
The present embodiment is a specific example of a scientific vocabulary word segmentation method based on the maximum likelihood probability proposed based on the method in the first embodiment. Specific examples are as follows:
taking the measurement of crude fiber in plant food and the measurement of plant protein as an example, two technological vocabulary input character strings to be segmented are taken, and an offline dictionary in the same field as the two technological vocabulary input character strings to be segmented is obtained. The offline dictionary has two columns, the first column is word and the second column is word frequency.
Because of limited space, the related contents of the offline dictionary related to the technological vocabulary input character strings of two words to be segmented in the offline dictionary and the 'measurement of crude fiber in plant food' and the 'measurement of plant protein' are intercepted as follows:
...
plant 560
Object 8620
In 243191
Plant 7735
Crude 2598
Fiber 403
Food-like product 3
Dimension 1685
318825 of (5)
Coarse fiber 3
Fiber 1879
Measurement 1768
Measurement 2083
Fixed 15882
Class 14536
Food 6610
Food 6350
Product 2278
Species 23
Egg 3862
White 12266
Protein 1087
Vegetable protein 3
...
A prefix dictionary is constructed in memory based on the offline dictionary. The prefixes of the words "coarse fibers" as in the offline dictionary are "coarse", "coarse fibers", respectively; the prefix of the word "food" is "food". The online prefix dictionary formed in the offline dictionary is shown below, where "coarse fibers" are not in the offline dictionary, so the word frequency in the online prefix dictionary is 0. The prefixes of the words "vegetable proteins" are "plant", "vegetable egg", respectively, wherein "vegetable egg" is not in the offline dictionary, so the word frequency in the online prefix dictionary is 0. This is to facilitate the construction of the directed acyclic graph.
The relevant content in the linear prefix dictionary is intercepted as follows:
...
plant 560
Object 8620
In 243191
Plant 7735
Crude 2598
Fiber 403
Food 0
Food-like product 3
Dimension 1685
318825 of (5)
Coarse fiber 3
Coarse fiber 0
Fiber 1879
Measurement 1768
Measurement 2083
Fixed 15882
Class 14536
Food 6610
Food 6350
Product 2278
Species 23
Egg 3862
White 12266
Protein 1087
Vegetable protein 3
Plant egg 0
...
Based on the linear prefix dictionary, the input character string "measurement of crude fiber in plant food" is cut. "Medium", "without prefix", there is only one way of splitting, they are independent words. The "plants" and the "measurement" have two cutting modes, namely "planting" and "plant"; "measurement", "determination". The foods and the crude fibers are divided into three modes, namely foods and foods; "coarse", "fibrous", "coarse fibrous".
As shown in FIG. 2, a directed acyclic graph constructed by "measurement of crude fiber in plant food" was shown. The internal data structure of the directed acyclic graph is constructed as follows:
0:[0,1]
1:[1,2]
2:[2,4]
3:[3,4]
4:[4]
5:[5]
6:[6,8]
7:[7,8]
8:[8]
9:[9]
10:[10,11]
11:[11]
the digits represent the position of a word in a sentence. The number preceding the colon is the primary key, starting with 0 and incrementing by 1 until the last position of the sentence. If 0 indicates "plant", 11 indicates "fix". The contents in brackets in the back of a colon represent the span of the word in front of the colon. 0: [0,1] represents "plant", "plant"; 4 [4] represents "good"; [6,8] represents "coarse" or "coarse fiber".
For another example, taking "measurement of vegetable proteins" as an example, a directed acyclic graph is constructed as shown in FIG. 3. The internal data structure of the directed acyclic graph is constructed as follows:
0:[0,1,3]
1:[1]
2:[2,3]
3:[3]
4:[4]
5:[5,6]
6:[6]
wherein 0: [0,1,3] represents "plant", "plant protein".
After the directed acyclic graph is obtained, a sentence has multiple paths from beginning to end, and the multiple paths indicate multiple word segmentation modes, such as:
measurement of crude fiber in plant food ":
word segmentation mode 1: plant/class/food/medium/crude fiber/assay
Word segmentation mode 2: plant/food-like/medium/coarse/fibrous/assay
Word segmentation mode 3: plant/implant/class/food/medium/coarse fibre/assay
...
"determination of vegetable proteins":
word segmentation mode 1: plant/implant/protein/assay
Word segmentation mode 2: plant/protein/assay
Word segmentation mode 3: plant protein/assay
...
Tens of millions of paths, the probability is first. It is necessary to calculate which path has the greatest likelihood probability, which is the best word segmentation result. The calculation is performed in a dynamic specification mode.
The weight of each arrow in the directed acyclic graph is the word frequency of the individual word divided by the total word frequency in the dictionary. The single word frequency is from the second column of the online prefix dictionary and the total word frequency is the sum of the second column of the online prefix dictionary. To avoid too small a weight value, the pair weights are taken as natural logarithms.
Wherein w is i Weights f for each arrow representing a span in the directed acyclic graph i For the word frequency of each word, f all For the total word frequency in the prefix dictionary, ln is the natural logarithm of the weight to avoid the weight value being too small;
route=w 1 +w 2 +...+w n =∑w i
the weight route of the word segmentation path is the sum of the weights of each word on the head-to-tail path of the word segmentation path.
The maximum likelihood probability is obtained specifically as follows:
and calculating the maximum value in the weights of all the word segmentation paths as the maximum likelihood probability of the word to the end point of the current search position.
Taking "measurement of vegetable proteins" as an example, the results based on the weights and maximum likelihood probabilities of the directed acyclic graph of FIG. 3 are shown in FIG. 4.
When index=6, the best word is "definite"
When index=5, the best word is "measurement". Because the natural logarithm of the probability of "measure" is-10.43, the natural logarithm of the probability of "measure/measure" is greater than-18.51.
When index=4, the best word is "determine". Because "metering" is already the optimal word segmentation for the subsequent node.
When index=3, the best word is "white/measured".
When index=2, the best term is "protein/assay". Because the natural logarithm of the probability of "protein/assay" is-26.59, it is greater than the natural logarithm of the probability of "protein/assay" is-33.82.
...
When index=0, the best term is "vegetable protein/assay". Because the natural logarithm of the probability of "plant protein/assay" is-32.49, it is greater than the natural logarithm of the probability of "plant/protein/assay" by-35.55, and also greater than the natural logarithm of the probability of "plant/protein/assay" by-47.03.
Conclusion: the best word is "plant protein/assay", the natural logarithm of probability is-32.49.
Further, taking "measurement of crude fibers in plant food" as an example, the result based on the weights and maximum likelihood probability of the directed acyclic graph as shown in fig. 2 is shown in fig. 5. The best segmentation is "plant/food-like/medium/coarse fiber/measure", the natural logarithm of probability is-63.77.
Third embodiment
As shown in fig. 6, the present embodiment provides a maximum likelihood based scientific vocabulary word segmentation system for performing the maximum likelihood based scientific vocabulary word segmentation method as in the first embodiment, comprising:
the system comprises an offline dictionary acquisition module 1, a word segmentation module and a word segmentation module, wherein the offline dictionary acquisition module 1 is used for acquiring an offline dictionary in the same field as a scientific and technological vocabulary input character string of a word to be segmented, and the offline dictionary comprises two columns including a word and a word frequency;
a prefix dictionary construction module 2, configured to construct a prefix dictionary of the technological vocabulary input character string to be segmented in a memory based on the offline dictionary;
the directed acyclic graph construction module 3 is used for cutting the technological vocabulary input character strings to be segmented on the basis of the prefix dictionary to construct a directed acyclic graph;
and the word segmentation result output module 4 is used for acquiring all word segmentation paths of the technical vocabulary input character strings of the words to be segmented based on the directed acyclic graph, and calculating the word segmentation path with the maximum likelihood probability in the word segmentation paths as the word segmentation result of the technical vocabulary input character strings of the words to be segmented.
A computer readable storage medium storing computer code which, when executed, performs a method as described above. Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
The above description is only a preferred embodiment of the present application, and the protection scope of the present application is not limited to the above examples, and all technical solutions belonging to the concept of the present application belong to the protection scope of the present application. It should be noted that modifications and adaptations to the present application may occur to one skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
It should be noted that the above embodiments can be freely combined as needed. The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims (10)

1. A scientific and technological vocabulary word segmentation method based on maximum likelihood probability is characterized by comprising the following steps:
s1: acquiring an offline dictionary in the same field as a scientific and technological vocabulary input character string of a word to be segmented, wherein the offline dictionary comprises two columns including a word and a word frequency;
s2: constructing a prefix dictionary of the technological vocabulary input character strings to be segmented in a memory based on the offline dictionary;
s3: based on the prefix dictionary, segmenting the technological vocabulary input character strings to be segmented to construct a directed acyclic graph;
s4: and acquiring all word segmentation paths of the technical vocabulary input character strings to be segmented based on the directed acyclic graph, and calculating the word segmentation path with the maximum likelihood probability in the word segmentation paths as a word segmentation result of the technical vocabulary input character strings to be segmented.
2. The maximum likelihood based scientific vocabulary word segmentation method according to claim 1, further comprising, prior to step S1, establishing the offline dictionary, in particular:
s11: acquiring a large-scale text corpus, wherein the text corpus comprises text data in different fields;
s12: performing text preprocessing on the text corpus, wherein the text preprocessing comprises the removal of punctuation marks, special characters and numbers;
s13: word segmentation is carried out on the text corpus after the text pretreatment, and word segmentation results are obtained;
s14: traversing each word in the word segmentation result, and counting the word frequency of each word;
s15: and storing each word in the word segmentation result and the corresponding word frequency in a one-to-one correspondence manner.
3. The maximum likelihood based technical vocabulary word segmentation method according to claim 1, wherein in step S2, the prefix dictionary of the technical vocabulary input character string to be segmented is built in a memory based on the offline dictionary, specifically:
s21: sequentially obtaining each word of the technological vocabulary input character string to be segmented;
s22, acquiring all prefixes of words based on each word;
s23: traversing the word frequency of all the prefixes of each word in the offline dictionary, wherein the word frequency takes the word frequency in the offline dictionary when the prefixes are in the offline dictionary, and takes 0 when the prefixes are not in the offline dictionary.
4. The maximum likelihood-based technological vocabulary word segmentation method according to claim 1, wherein in step S3, based on the prefix dictionary, the technological vocabulary input character string to be segmented is segmented, and the directed acyclic graph is constructed, specifically:
for independent words without prefixes in the technical vocabulary input character string of the words to be segmented, only one segmentation mode exists, the independent words are formed, and for words with prefixes in the technical vocabulary input character string of the words to be segmented, all segmentation modes are listed;
the internal structure of the directed acyclic graph is as follows:
0:[q 1 ,q 2 ...q n ];
1:[q 1 ,q 2 ...q n ];
...
m-1:[q 1 ,q 2 ...q n ];
wherein 0 to m-1 represent the positions of single words in the technical vocabulary input character string of the word to be segmented in sentences of the technical vocabulary input character string of the word to be segmented, and each time increment is 1 from 0 until the last position m-1 in the sentences is the word number; q 1 To q n The span of word segmentation results for words beginning with the current word and n is the number of word segments for words beginning with the current word.
5. The maximum likelihood-based technical vocabulary word segmentation method according to claim 1, wherein in step S4, all word segmentation paths of the technical vocabulary input character string to be segmented are obtained based on the directed acyclic graph, and the word segmentation path with the maximum likelihood probability in the word segmentation paths is calculated as the word segmentation result of the technical vocabulary input character string to be segmented, specifically:
carrying out path planning on the directed acyclic graph by adopting a dynamic path optimization algorithm, wherein the method comprises reverse searching optimization and forward solving;
searching from the end point to the starting point of the directed acyclic graph by adopting the reverse optimizing method, calculating weights of all word segmentation paths from the word at the current searching position to the end point by adopting the forward solving method aiming at the word at the current searching position in the directed acyclic graph in the searching process, and acquiring the maximum likelihood probability from the word at the current searching position to the end point according to the weights, wherein the word segmentation path corresponding to the maximum likelihood probability is used as the final word segmentation path from the word at the current searching position to the end point;
the reverse optimizing method is adopted to search continuously to a starting point on the basis of the final word segmentation path from the word at the current searching position to the end point, the forward solving method is adopted to calculate the weights of all the word segmentation paths based on the final word segmentation path determined by the last searching position from the word at the next searching position to the end point, the maximum likelihood probability from the word at the next searching position to the end point is obtained according to the weights, the word segmentation path corresponding to the maximum likelihood probability is used as the final word segmentation path from the word at the next searching position to the end point until the starting point is searched, and the final word segmentation path of the complete technological vocabulary input character string to be segmented is obtained;
and taking the final word segmentation path of the whole technological vocabulary input character string to be segmented as the word segmentation result of the technological vocabulary input character string to be segmented.
6. The maximum likelihood based scientific vocabulary word segmentation method according to claim 5, wherein the weight of the word segmentation path is calculated by the following steps:
wherein w is i Weights f for each arrow representing a span in the directed acyclic graph i For the word frequency of each word, f all For the total word frequency in the prefix dictionary, ln is the natural logarithm of the weight to avoid the weight value being too small;
route=w 1 +w 2 +...+w n =∑w i
the weight route of the word segmentation path is the sum of the weights of each word on the head-to-tail path of the word segmentation path.
7. The maximum likelihood based scientific vocabulary word segmentation method according to claim 5, wherein the maximum likelihood is obtained specifically as follows:
and calculating the maximum value in the weights of all the word segmentation paths as the maximum likelihood probability of the word to the end point of the current search position.
8. A maximum likelihood based scientific vocabulary word segmentation system for performing the maximum likelihood based scientific vocabulary word segmentation method of claims 1-7 comprising:
the system comprises an offline dictionary acquisition module, a word segmentation module and a word segmentation module, wherein the offline dictionary acquisition module is used for acquiring an offline dictionary in the same field as a scientific and technological vocabulary input character string of a word to be segmented, and the offline dictionary comprises two columns including a word and a word frequency;
the prefix dictionary construction module is used for constructing a prefix dictionary of the technological vocabulary input character strings to be segmented in a memory based on the offline dictionary;
the directed acyclic graph construction module is used for cutting the technological vocabulary input character strings to be segmented on the basis of the prefix dictionary to construct a directed acyclic graph;
and the word segmentation result output module is used for acquiring all word segmentation paths of the technical vocabulary input character strings to be segmented based on the directed acyclic graph, and calculating the word segmentation path with the maximum likelihood probability in the word segmentation paths as the word segmentation result of the technical vocabulary input character strings to be segmented.
9. A computer device comprising a memory and one or more processors, the memory having stored therein computer code that, when executed by the one or more processors, causes the one or more processors to perform the method of any of claims 1-7.
10. A computer readable storage medium storing computer code which, when executed, performs the method of any one of claims 1 to 7.
CN202310690365.0A 2023-06-12 2023-06-12 Scientific and technological vocabulary word segmentation method and system based on maximum likelihood probability Pending CN116933783A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310690365.0A CN116933783A (en) 2023-06-12 2023-06-12 Scientific and technological vocabulary word segmentation method and system based on maximum likelihood probability

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310690365.0A CN116933783A (en) 2023-06-12 2023-06-12 Scientific and technological vocabulary word segmentation method and system based on maximum likelihood probability

Publications (1)

Publication Number Publication Date
CN116933783A true CN116933783A (en) 2023-10-24

Family

ID=88385342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310690365.0A Pending CN116933783A (en) 2023-06-12 2023-06-12 Scientific and technological vocabulary word segmentation method and system based on maximum likelihood probability

Country Status (1)

Country Link
CN (1) CN116933783A (en)

Similar Documents

Publication Publication Date Title
US7644047B2 (en) Semantic similarity based document retrieval
US5794177A (en) Method and apparatus for morphological analysis and generation of natural language text
CN102479191B (en) Method and device for providing multi-granularity word segmentation result
US20100094835A1 (en) Automatic query concepts identification and drifting for web search
Sanyal et al. Resume parser with natural language processing
US20210350125A1 (en) System for searching natural language documents
EP3864565A1 (en) Method of searching patent documents
US20220114340A1 (en) System and method for an automatic search and comparison tool
EP3864566A1 (en) Method of training a natural language search system, search system and corresponding use
CN112380848B (en) Text generation method, device, equipment and storage medium
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
WO2012143839A1 (en) A computerized system and a method for processing and building search strings
CN116933783A (en) Scientific and technological vocabulary word segmentation method and system based on maximum likelihood probability
Urata et al. An entity disambiguation approach based on wikipedia for entity linking in microblogs
Momin et al. Web document clustering using document index graph
Sariki et al. A book recommendation system based on named entities
CN114595684A (en) Abstract generation method and device, electronic equipment and storage medium
JP2008197952A (en) Text segmentation method, its device, its program and computer readable recording medium
JP3894428B2 (en) Information extraction method, information retrieval method, and information extraction computer program
German et al. Information extraction method from a resume (CV)
CN111930880A (en) Text code retrieval method, device and medium
Wen Text mining using HMM and PMM
Kale et al. Job Tailored Resume Content Generation
Gope et al. Knowledge Extraction from Bangla Documents: A Case Study
Bakar et al. An evaluation of retrieval effectiveness using spelling‐correction and string‐similarity matching methods on Malay texts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination