CN116933783A

CN116933783A - Scientific and technological vocabulary word segmentation method and system based on maximum likelihood probability

Info

Publication number: CN116933783A
Application number: CN202310690365.0A
Authority: CN
Inventors: 何军; 赵燕; 胡俊松; 徐旻昕
Original assignee: Shanghai R&d Public Service Platform Management Center
Current assignee: Shanghai R&d Public Service Platform Management Center
Priority date: 2023-06-12
Filing date: 2023-06-12
Publication date: 2023-10-24

Abstract

The application relates to the technical field of word segmentation algorithms, and provides a scientific and technological vocabulary word segmentation method based on maximum likelihood probability, which comprises the following steps: s1: acquiring an offline dictionary in the same field as a scientific and technological vocabulary input character string of a word to be segmented, wherein the offline dictionary comprises two columns including a word and a word frequency; s2: constructing a prefix dictionary of the technological vocabulary input character strings to be segmented in a memory based on the offline dictionary; s3: based on the prefix dictionary, segmenting the technological vocabulary input character strings to be segmented to construct a directed acyclic graph; s4: and acquiring all word segmentation paths of the technical vocabulary input character strings to be segmented based on the directed acyclic graph, and calculating the word segmentation path with the maximum likelihood probability in the word segmentation paths as a word segmentation result of the technical vocabulary input character strings to be segmented. And calculating an optimal segmentation path based on the maximum likelihood probability, and obtaining an optimal word segmentation result.

Description

Scientific and technological vocabulary word segmentation method and system based on maximum likelihood probability

Technical Field

The application relates to the technical field of word segmentation algorithms, in particular to a scientific and technological vocabulary word segmentation method and system based on maximum likelihood probability.

Background

Segmentation is the process of segmenting the word into individual words or phrases. In natural language processing, word segmentation is an important preprocessing step, and has important significance for subsequent text processing and analysis tasks. The meaning of a word may vary in different contexts depending on the purpose and application of the word. The following are some common word segmentation meanings:

language understanding and semantic analysis: by decomposing a segment of speech into words, the structure and semantics of sentences can be better understood. This facilitates natural language processing tasks such as part-of-speech tagging, syntactic analysis, semantic role tagging, etc., to extract more semantic information and context.

Information retrieval and search: in information retrieval and search engines, word segmentation of query sentences may split a query into separate keywords to more accurately match and retrieve related documents or web pages. This helps to improve the relevance and accuracy of the search results.

Machine translation: word segmentation is critical to the task of machine translation. Decomposing the source language sentence into words may better correspond to words or phrases in the target language, thereby helping the machine translation system to perform accurate translations.

Text mining and information extraction: the segmentation may provide a basis for text mining and information extraction tasks. By segmenting the text into words, information such as keywords, entity nouns, phrases and the like can be better identified, thereby helping mining and extracting specific information.

Text classification and emotion analysis: in text classification and emotion analysis, the segmentation can convert the text into discrete feature representations for classification, emotion judgment and other tasks. The segmentation of text into words can provide richer feature information, helping to improve the accuracy of classification and emotion analysis.

In summary, word segmentation of a word has the meaning of segmenting continuous text into discrete words, thereby providing more accurate and rich language expressions and feature representations for subsequent natural language processing tasks.

In the prior art, predefined rules are typically employed to segment sentences. For example, the segmentation may be based on spaces, punctuation, or specific segmentors. The method is simple and direct, but because each word has a plurality of different prefix segmentation modes, an optimal segmentation mode is not found, and the segmentation result is not optimal.

Disclosure of Invention

Aiming at the problems, the application aims to provide a scientific vocabulary word segmentation method and a system based on maximum likelihood probability, which calculate an optimal segmentation path based on the maximum likelihood probability and acquire an optimal word segmentation result.

The above object of the present application is achieved by the following technical solutions:

a scientific and technological vocabulary word segmentation method based on maximum likelihood probability comprises the following steps:

s1: acquiring an offline dictionary in the same field as a scientific and technological vocabulary input character string of a word to be segmented, wherein the offline dictionary comprises two columns including a word and a word frequency;

s2: constructing a prefix dictionary of the technological vocabulary input character strings to be segmented in a memory based on the offline dictionary;

s3: based on the prefix dictionary, segmenting the technological vocabulary input character strings to be segmented to construct a directed acyclic graph;

s4: and acquiring all word segmentation paths of the technical vocabulary input character strings to be segmented based on the directed acyclic graph, and calculating the word segmentation path with the maximum likelihood probability in the word segmentation paths as a word segmentation result of the technical vocabulary input character strings to be segmented.

Further, before step S1, the method further includes establishing the offline dictionary, specifically:

s11: acquiring a large-scale text corpus, wherein the text corpus comprises text data in different fields;

s12: performing text preprocessing on the text corpus, wherein the text preprocessing comprises the removal of punctuation marks, special characters and numbers;

s13: word segmentation is carried out on the text corpus after the text pretreatment, and word segmentation results are obtained;

s14: traversing each word in the word segmentation result, and counting the word frequency of each word;

s15: and storing each word in the word segmentation result and the corresponding word frequency in a one-to-one correspondence manner.

Further, in step S2, the prefix dictionary of the technical vocabulary input character string to be segmented is built in a memory based on the offline dictionary, specifically:

s21: sequentially obtaining each word of the technological vocabulary input character string to be segmented;

s22, acquiring all prefixes of words based on each word;

s23: traversing the word frequency of all the prefixes of each word in the offline dictionary, wherein the word frequency takes the word frequency in the offline dictionary when the prefixes are in the offline dictionary, and takes 0 when the prefixes are not in the offline dictionary.

Further, in step S3, based on the prefix dictionary, the technological vocabulary input character string to be segmented is segmented, and the directed acyclic graph is constructed, specifically:

for independent words without prefixes in the technical vocabulary input character string of the words to be segmented, only one segmentation mode exists, the independent words are formed, and for words with prefixes in the technical vocabulary input character string of the words to be segmented, all segmentation modes are listed;

the internal structure of the directed acyclic graph is as follows:

0:[q ₁ ,q ₂ ...q _n ]；

1:[q ₁ ,q ₂ ...q _n ]；

...

m-1:[q ₁ ,q ₂ ...q _n ]；

wherein 0 to m-1 represent the positions of single words in the technical vocabulary input character string of the word to be segmented in sentences of the technical vocabulary input character string of the word to be segmented, and each time increment is 1 from 0 until the last position m-1 in the sentences is the word number; q ₁ To q _n The span of word segmentation results for words beginning with the current word and n is the number of word segments for words beginning with the current word.

Further, in step S4, all word segmentation paths of the technical vocabulary input character string to be segmented are obtained based on the directed acyclic graph, and the word segmentation path with the maximum likelihood probability in the word segmentation paths is calculated as the word segmentation result of the technical vocabulary input character string to be segmented, specifically:

carrying out path planning on the directed acyclic graph by adopting a dynamic path optimization algorithm, wherein the method comprises reverse searching optimization and forward solving;

searching from the end point to the starting point of the directed acyclic graph by adopting the reverse optimizing method, calculating weights of all word segmentation paths from the word at the current searching position to the end point by adopting the forward solving method aiming at the word at the current searching position in the directed acyclic graph in the searching process, and acquiring the maximum likelihood probability from the word at the current searching position to the end point according to the weights, wherein the word segmentation path corresponding to the maximum likelihood probability is used as the final word segmentation path from the word at the current searching position to the end point;

the reverse optimizing method is adopted to search continuously to a starting point on the basis of the final word segmentation path from the word at the current searching position to the end point, the forward solving method is adopted to calculate the weights of all the word segmentation paths based on the final word segmentation path determined by the last searching position from the word at the next searching position to the end point, the maximum likelihood probability from the word at the next searching position to the end point is obtained according to the weights, the word segmentation path corresponding to the maximum likelihood probability is used as the final word segmentation path from the word at the next searching position to the end point until the starting point is searched, and the final word segmentation path of the complete technological vocabulary input character string to be segmented is obtained;

and taking the final word segmentation path of the whole technological vocabulary input character string to be segmented as the word segmentation result of the technological vocabulary input character string to be segmented.

Further, the weight of the word segmentation path is calculated by the following steps:

wherein w is _i Weights f for each arrow representing a span in the directed acyclic graph _i For the word frequency of each word, f _all For the total word frequency in the prefix dictionary, ln is the natural logarithm of the weight to avoid the weight value being too small;

route＝w ₁ +w ₂ +...+w _n ＝∑w _i

the weight route of the word segmentation path is the sum of the weights of each word on the head-to-tail path of the word segmentation path.

Further, the maximum likelihood probability is obtained, specifically:

and calculating the maximum value in the weights of all the word segmentation paths as the maximum likelihood probability of the word to the end point of the current search position.

A maximum likelihood based scientific vocabulary word segmentation system for performing the above-described maximum likelihood based scientific vocabulary word segmentation method, comprising:

the system comprises an offline dictionary acquisition module, a word segmentation module and a word segmentation module, wherein the offline dictionary acquisition module is used for acquiring an offline dictionary in the same field as a scientific and technological vocabulary input character string of a word to be segmented, and the offline dictionary comprises two columns including a word and a word frequency;

the prefix dictionary construction module is used for constructing a prefix dictionary of the technological vocabulary input character strings to be segmented in a memory based on the offline dictionary;

the directed acyclic graph construction module is used for cutting the technological vocabulary input character strings to be segmented on the basis of the prefix dictionary to construct a directed acyclic graph;

and the word segmentation result output module is used for acquiring all word segmentation paths of the technical vocabulary input character strings to be segmented based on the directed acyclic graph, and calculating the word segmentation path with the maximum likelihood probability in the word segmentation paths as the word segmentation result of the technical vocabulary input character strings to be segmented.

A computer device comprising a memory and one or more processors, the memory having stored therein computer code which, when executed by the one or more processors, causes the one or more processors to perform a method as described above.

A computer readable storage medium storing computer code which, when executed, performs a method as described above.

Compared with the prior art, the application has at least one of the following beneficial effects:

(1) The scientific and technological vocabulary word segmentation method based on the maximum likelihood probability comprises the following steps: s1: acquiring an offline dictionary in the same field as a scientific and technological vocabulary input character string of a word to be segmented, wherein the offline dictionary comprises two columns including a word and a word frequency; s2: constructing a prefix dictionary of the technological vocabulary input character strings to be segmented in a memory based on the offline dictionary; s3: based on the prefix dictionary, segmenting the technological vocabulary input character strings to be segmented to construct a directed acyclic graph; s4: and acquiring all word segmentation paths of the technical vocabulary input character strings to be segmented based on the directed acyclic graph, and calculating the word segmentation path with the maximum likelihood probability in the word segmentation paths as a word segmentation result of the technical vocabulary input character strings to be segmented. According to the technical scheme, the prefix dictionary is built based on the existing offline dictionary comprising word frequency, likelihood probability is calculated for each path based on likelihood probability algorithm, and the optimal segmentation path is obtained to obtain the optimal segmentation result.

(2) By establishing the word segmentation path based on the offline dictionary of word frequency, the word segmentation result is more close to the actual word segmentation habit, and the word segmentation result is more accurate.

Drawings

FIG. 1 is a general flow chart of a technique vocabulary word segmentation method based on maximum likelihood probability;

FIG. 2 is a directed acyclic graph constructed by the application "measurement of crude fiber in plant based food";

FIG. 3 is a directed acyclic graph constructed according to the application "assay of vegetable proteins";

FIG. 4 is a graph showing the result of the calculation of the maximum likelihood probability of "measurement of plant protein" according to the present application;

FIG. 5 is a graph showing the result of the maximum likelihood probability calculation of crude fiber measurement in plant food

FIG. 6 is an overall block diagram of a technique vocabulary word segmentation method based on maximum likelihood probability according to the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

First embodiment

As shown in fig. 1, the present embodiment provides a scientific vocabulary word segmentation method based on maximum likelihood probability, which includes the following steps:

s1: and acquiring an offline dictionary in the same field as the scientific and technological vocabulary input character strings of the words to be segmented, wherein the offline dictionary comprises two columns including words and word frequencies.

Specifically, the offline dictionary may be an existing offline dictionary downloaded from another route, or may be an offline dictionary established by itself. For self-building of an offline dictionary, the following method can be used:

s11: a large-scale text corpus is obtained, and the text corpus comprises text data in different fields. The text corpus may be news, books, wikipedia, or other sources.

S12: and performing text preprocessing on the text corpus, wherein punctuation marks, special characters and numbers are removed. May be implemented using conventional text processing tools or programming languages.

S13: and performing word segmentation on the text corpus subjected to the text pretreatment to obtain word segmentation results. Such as using a maximum matching algorithm, or a machine learning based approach, such as a conditional random field (CR F) or a Recurrent Neural Network (RNN), etc.

S14: traversing each word in the word segmentation result, and counting the word frequency of each word. A hash table, field, or other data structure may be used to record words and their frequency information.

In addition, the offline dictionary can be optimized, and some optimization operations can be performed according to requirements, for example: removing low frequency words: words with too low a frequency, which may be noise or unimportant information, are deleted. Merging words: adjacent and frequently occurring words are combined into phrases or proper nouns, so that the word segmentation accuracy is improved. The optimized dictionary is stored in a proper format, such as text file, binary file or database. Common data serialization methods such as JSON, CSV, or jackle may be used. When word segmentation is needed, the offline dictionary is loaded into the memory so as to perform quick query.

S2: constructing a prefix dictionary of the technological vocabulary input character string to be segmented in a memory based on the offline dictionary, wherein the prefix dictionary specifically comprises the following steps of:

s21: and sequentially acquiring each word of the technological vocabulary input character string to be segmented.

And S22, acquiring all prefixes of the words based on each word.

S3: based on the prefix dictionary, the technological vocabulary input character string to be segmented is segmented, and a directed acyclic graph is constructed, specifically:

the internal structure of the directed acyclic graph is as follows:

0:[q ₁ ,q ₂ ...q _n ]；

1:[q ₁ ,q ₂ ...q _n ]；

...

m-1:[q ₁ ,q ₂ ...q _n ]；

S4: based on the directed acyclic graph, acquiring all word segmentation paths of the technical vocabulary input character strings to be segmented, and calculating the word segmentation path with the maximum likelihood probability in the word segmentation paths as a word segmentation result of the technical vocabulary input character strings to be segmented, wherein the word segmentation result comprises the following specific steps:

and carrying out path planning on the directed acyclic graph by adopting a dynamic path optimization algorithm, wherein the method comprises reverse searching optimization and forward solving.

And searching from the end point to the starting point of the directed acyclic graph by adopting the reverse optimizing method, calculating weights of all word segmentation paths from the word at the current searching position to the end point by adopting the forward solving method aiming at the word at the current searching position in the directed acyclic graph in the searching process, and acquiring the maximum likelihood probability from the word at the current searching position to the end point according to the weights, wherein the word segmentation path corresponding to the maximum likelihood probability is used as the final word segmentation path from the word at the current searching position to the end point.

And adopting the reverse optimizing method to continue searching towards a starting point based on the final word segmentation path from the word at the current searching position to the end point, adopting the forward solving method to calculate the weights of all the word segmentation paths based on the final word segmentation path determined by the last searching position from the word at the next searching position to the end point, and acquiring the maximum likelihood probability from the word at the next searching position to the end point according to the weights, wherein the word segmentation path corresponding to the maximum likelihood probability is used as the final word segmentation path from the word at the next searching position to the end point until the starting point is searched, so as to acquire the final word segmentation path of the complete technological vocabulary input character string to be segmented.

In step S4, the weight of the word segmentation path is calculated by the following method:

route＝w ₁ +w ₂ +...+w _n ＝∑w _i

In step S4, the maximum likelihood probability is obtained, specifically:

Second embodiment

The present embodiment is a specific example of a scientific vocabulary word segmentation method based on the maximum likelihood probability proposed based on the method in the first embodiment. Specific examples are as follows:

taking the measurement of crude fiber in plant food and the measurement of plant protein as an example, two technological vocabulary input character strings to be segmented are taken, and an offline dictionary in the same field as the two technological vocabulary input character strings to be segmented is obtained. The offline dictionary has two columns, the first column is word and the second column is word frequency.

Because of limited space, the related contents of the offline dictionary related to the technological vocabulary input character strings of two words to be segmented in the offline dictionary and the 'measurement of crude fiber in plant food' and the 'measurement of plant protein' are intercepted as follows:

...

plant 560

Object 8620

In 243191

Plant 7735

Crude 2598

Fiber 403

Food-like product 3

Dimension 1685

318825 of (5)

Coarse fiber 3

Fiber 1879

Measurement 1768

Measurement 2083

Fixed 15882

Class 14536

Food 6610

Food 6350

Product 2278

Species 23

Egg 3862

White 12266

Protein 1087

Vegetable protein 3

...

A prefix dictionary is constructed in memory based on the offline dictionary. The prefixes of the words "coarse fibers" as in the offline dictionary are "coarse", "coarse fibers", respectively; the prefix of the word "food" is "food". The online prefix dictionary formed in the offline dictionary is shown below, where "coarse fibers" are not in the offline dictionary, so the word frequency in the online prefix dictionary is 0. The prefixes of the words "vegetable proteins" are "plant", "vegetable egg", respectively, wherein "vegetable egg" is not in the offline dictionary, so the word frequency in the online prefix dictionary is 0. This is to facilitate the construction of the directed acyclic graph.

The relevant content in the linear prefix dictionary is intercepted as follows:

...

plant 560

Object 8620

In 243191

Plant 7735

Crude 2598

Fiber 403

Food 0

Food-like product 3

Dimension 1685

318825 of (5)

Coarse fiber 3

Coarse fiber 0

Fiber 1879

Measurement 1768

Measurement 2083

Fixed 15882

Class 14536

Food 6610

Food 6350

Product 2278

Species 23

Egg 3862

White 12266

Protein 1087

Vegetable protein 3

Plant egg 0

...

Based on the linear prefix dictionary, the input character string "measurement of crude fiber in plant food" is cut. "Medium", "without prefix", there is only one way of splitting, they are independent words. The "plants" and the "measurement" have two cutting modes, namely "planting" and "plant"; "measurement", "determination". The foods and the crude fibers are divided into three modes, namely foods and foods; "coarse", "fibrous", "coarse fibrous".

As shown in FIG. 2, a directed acyclic graph constructed by "measurement of crude fiber in plant food" was shown. The internal data structure of the directed acyclic graph is constructed as follows:

0:[0,1]

1:[1,2]

2:[2,4]

3:[3,4]

4:[4]

5:[5]

6:[6,8]

7:[7,8]

8:[8]

9:[9]

10:[10,11]

11:[11]

the digits represent the position of a word in a sentence. The number preceding the colon is the primary key, starting with 0 and incrementing by 1 until the last position of the sentence. If 0 indicates "plant", 11 indicates "fix". The contents in brackets in the back of a colon represent the span of the word in front of the colon. 0: [0,1] represents "plant", "plant"; 4 [4] represents "good"; [6,8] represents "coarse" or "coarse fiber".

For another example, taking "measurement of vegetable proteins" as an example, a directed acyclic graph is constructed as shown in FIG. 3. The internal data structure of the directed acyclic graph is constructed as follows:

0:[0,1,3]

1:[1]

2:[2,3]

3:[3]

4:[4]

5:[5,6]

6:[6]

wherein 0: [0,1,3] represents "plant", "plant protein".

After the directed acyclic graph is obtained, a sentence has multiple paths from beginning to end, and the multiple paths indicate multiple word segmentation modes, such as:

measurement of crude fiber in plant food ":

word segmentation mode 1: plant/class/food/medium/crude fiber/assay

Word segmentation mode 2: plant/food-like/medium/coarse/fibrous/assay

Word segmentation mode 3: plant/implant/class/food/medium/coarse fibre/assay

...

"determination of vegetable proteins":

word segmentation mode 1: plant/implant/protein/assay

Word segmentation mode 2: plant/protein/assay

Word segmentation mode 3: plant protein/assay

...

Tens of millions of paths, the probability is first. It is necessary to calculate which path has the greatest likelihood probability, which is the best word segmentation result. The calculation is performed in a dynamic specification mode.

The weight of each arrow in the directed acyclic graph is the word frequency of the individual word divided by the total word frequency in the dictionary. The single word frequency is from the second column of the online prefix dictionary and the total word frequency is the sum of the second column of the online prefix dictionary. To avoid too small a weight value, the pair weights are taken as natural logarithms.

route＝w ₁ +w ₂ +...+w _n ＝∑w _i

The maximum likelihood probability is obtained specifically as follows:

Taking "measurement of vegetable proteins" as an example, the results based on the weights and maximum likelihood probabilities of the directed acyclic graph of FIG. 3 are shown in FIG. 4.

When index=6, the best word is "definite"

When index=5, the best word is "measurement". Because the natural logarithm of the probability of "measure" is-10.43, the natural logarithm of the probability of "measure/measure" is greater than-18.51.

When index=4, the best word is "determine". Because "metering" is already the optimal word segmentation for the subsequent node.

When index=3, the best word is "white/measured".

When index=2, the best term is "protein/assay". Because the natural logarithm of the probability of "protein/assay" is-26.59, it is greater than the natural logarithm of the probability of "protein/assay" is-33.82.

...

When index=0, the best term is "vegetable protein/assay". Because the natural logarithm of the probability of "plant protein/assay" is-32.49, it is greater than the natural logarithm of the probability of "plant/protein/assay" by-35.55, and also greater than the natural logarithm of the probability of "plant/protein/assay" by-47.03.

Conclusion: the best word is "plant protein/assay", the natural logarithm of probability is-32.49.

Further, taking "measurement of crude fibers in plant food" as an example, the result based on the weights and maximum likelihood probability of the directed acyclic graph as shown in fig. 2 is shown in fig. 5. The best segmentation is "plant/food-like/medium/coarse fiber/measure", the natural logarithm of probability is-63.77.

Third embodiment

As shown in fig. 6, the present embodiment provides a maximum likelihood based scientific vocabulary word segmentation system for performing the maximum likelihood based scientific vocabulary word segmentation method as in the first embodiment, comprising:

the system comprises an offline dictionary acquisition module 1, a word segmentation module and a word segmentation module, wherein the offline dictionary acquisition module 1 is used for acquiring an offline dictionary in the same field as a scientific and technological vocabulary input character string of a word to be segmented, and the offline dictionary comprises two columns including a word and a word frequency;

a prefix dictionary construction module 2, configured to construct a prefix dictionary of the technological vocabulary input character string to be segmented in a memory based on the offline dictionary;

the directed acyclic graph construction module 3 is used for cutting the technological vocabulary input character strings to be segmented on the basis of the prefix dictionary to construct a directed acyclic graph;

and the word segmentation result output module 4 is used for acquiring all word segmentation paths of the technical vocabulary input character strings of the words to be segmented based on the directed acyclic graph, and calculating the word segmentation path with the maximum likelihood probability in the word segmentation paths as the word segmentation result of the technical vocabulary input character strings of the words to be segmented.

A computer readable storage medium storing computer code which, when executed, performs a method as described above. Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The above description is only a preferred embodiment of the present application, and the protection scope of the present application is not limited to the above examples, and all technical solutions belonging to the concept of the present application belong to the protection scope of the present application. It should be noted that modifications and adaptations to the present application may occur to one skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

It should be noted that the above embodiments can be freely combined as needed. The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A scientific and technological vocabulary word segmentation method based on maximum likelihood probability is characterized by comprising the following steps:

2. The maximum likelihood based scientific vocabulary word segmentation method according to claim 1, further comprising, prior to step S1, establishing the offline dictionary, in particular:

3. The maximum likelihood based technical vocabulary word segmentation method according to claim 1, wherein in step S2, the prefix dictionary of the technical vocabulary input character string to be segmented is built in a memory based on the offline dictionary, specifically:

s22, acquiring all prefixes of words based on each word;

4. The maximum likelihood-based technological vocabulary word segmentation method according to claim 1, wherein in step S3, based on the prefix dictionary, the technological vocabulary input character string to be segmented is segmented, and the directed acyclic graph is constructed, specifically:

the internal structure of the directed acyclic graph is as follows:

0:[q ₁ ,q ₂ ...q _n ]；

1:[q ₁ ,q ₂ ...q _n ]；

...

m-1:[q ₁ ,q ₂ ...q _n ]；

5. The maximum likelihood-based technical vocabulary word segmentation method according to claim 1, wherein in step S4, all word segmentation paths of the technical vocabulary input character string to be segmented are obtained based on the directed acyclic graph, and the word segmentation path with the maximum likelihood probability in the word segmentation paths is calculated as the word segmentation result of the technical vocabulary input character string to be segmented, specifically:

6. The maximum likelihood based scientific vocabulary word segmentation method according to claim 5, wherein the weight of the word segmentation path is calculated by the following steps:

route＝w ₁ +w ₂ +...+w _n ＝∑w _i

7. The maximum likelihood based scientific vocabulary word segmentation method according to claim 5, wherein the maximum likelihood is obtained specifically as follows:

8. A maximum likelihood based scientific vocabulary word segmentation system for performing the maximum likelihood based scientific vocabulary word segmentation method of claims 1-7 comprising:

9. A computer device comprising a memory and one or more processors, the memory having stored therein computer code that, when executed by the one or more processors, causes the one or more processors to perform the method of any of claims 1-7.

10. A computer readable storage medium storing computer code which, when executed, performs the method of any one of claims 1 to 7.