CN113971397A - Word segmentation method and device, electronic equipment and storage medium - Google Patents

Word segmentation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113971397A
CN113971397A CN202010722115.7A CN202010722115A CN113971397A CN 113971397 A CN113971397 A CN 113971397A CN 202010722115 A CN202010722115 A CN 202010722115A CN 113971397 A CN113971397 A CN 113971397A
Authority
CN
China
Prior art keywords
word
candidate
words
probability
participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010722115.7A
Other languages
Chinese (zh)
Inventor
孙莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202010722115.7A priority Critical patent/CN113971397A/en
Publication of CN113971397A publication Critical patent/CN113971397A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a word segmentation method, a word segmentation device, electronic equipment and a storage medium. The method comprises the following steps: selecting a first target candidate word from a first group of candidate words according to a first parameter of the candidate words in the first group of candidate words obtained by performing forward word segmentation on a text to be processed, and obtaining a first word segmentation result of the text to be processed; selecting a second target candidate word from the second group of candidate words according to a second parameter of the candidate words in the second group of candidate words obtained by backward word segmentation of the text to be processed, and obtaining a second word segmentation result of the text to be processed; determining at least one pair of divergent words in the first and second word segmentation results; and determining a participle from the pair of participles as a participle result of the pair of participles according to a set algorithm so as to obtain a third participle result of the text to be processed.

Description

Word segmentation method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a word segmentation method and apparatus, an electronic device, and a storage medium.
Background
At present, when a character string matching algorithm is used for word segmentation, word segmentation is carried out according to a pre-configured dictionary and characteristics, and when the complex sentence is segmented, the processing effect is poor, so that the word segmentation accuracy of the complex sentence is reduced.
Disclosure of Invention
In view of this, embodiments of the present invention provide a word segmentation method, apparatus, electronic device and storage medium, so as to at least solve the problem of the related art that the accuracy of segmenting a complex sentence is reduced.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a word segmentation method, which comprises the following steps:
selecting a first target candidate word from a first group of candidate words according to a first parameter of the candidate words in the first group of candidate words of the text to be processed, and obtaining a first word segmentation result of the text to be processed, wherein the first group of candidate words is a group of candidate words obtained by segmenting the text to be processed according to the direction from front to back, and the first parameter is the accumulated word frequency probability of all candidate word combinations in the left adjacent words of one candidate word; the first target candidate word is a left adjacent word corresponding to the maximum first parameter of one candidate word;
selecting a second target candidate word from a second group of candidate words according to a second parameter of the candidate words in the second group of candidate words of the text to be processed in the front-to-back direction to obtain a second word segmentation result of the text to be processed, wherein the second group of candidate words is a group of candidate words obtained by segmenting the text to be processed in the front-to-back direction, and the second parameter is the accumulated part-of-speech probability of all candidate word combinations in right adjacent words of one candidate word; the second target candidate word is a right adjacent word corresponding to the largest second parameter of the candidate word;
determining at least one pair of divergent words in the first word segmentation result and the second word segmentation result, wherein the pair of divergent words comprises one of the first word segmentation result and one of the second word segmentation result, and the pair of divergent words corresponds to a participle difference between the first word segmentation result and the second word segmentation result;
and determining a participle from the pair of participles as a final participle result of the pair of participles according to a set algorithm so as to obtain a third participle result of the text to be processed.
In the foregoing scheme, the selecting, according to a first parameter of a candidate word in a first group of candidate words of the text to be processed, a first target candidate word from the first group of candidate words in a backward-forward direction to obtain a first word segmentation result of the text to be processed includes:
for each candidate word in the first group of candidate words, determining a word frequency probability corresponding to each candidate word in a set word frequency dictionary, and determining a first parameter corresponding to each candidate word according to the word frequency probability corresponding to each candidate word in the set word frequency dictionary;
and selecting a first target candidate word from the first group of candidate words according to the backward-forward direction based on the maximum first parameter of each candidate word to obtain a first word segmentation result of the text to be processed.
In the foregoing solution, the determining, according to a set algorithm, a participle from the pair of participles as a final participle result of the pair of participles to obtain a third participle result of the to-be-processed text includes:
determining a first word segmentation quantity of the first word segmentation result and a second word segmentation quantity of the second word segmentation result;
detecting whether the first participle quantity is the same as the second participle quantity to obtain a detection result;
and selecting a participle from the pair of participles as a final participle result of the pair of participles by adopting a set algorithm corresponding to the detection result so as to obtain a third participle result of the text to be processed.
In the foregoing solution, the determining, by using a set algorithm corresponding to the detection result, a participle from the pair of participles as a final participle result of the pair of participles to obtain a third participle result of the to-be-processed text includes:
under the condition that the detection result represents that the number of the first participles is the same as that of the second participles, determining a first probability and a second probability according to a set dictionary; the set dictionary stores the probability of connection between the participles; the first probability is the probability of connection between each participle in the at least one participle in the first participle result and an adjacent word; the second probability is the probability of connection between each participle and adjacent words in at least one participle in the second participle result;
selecting a participle from each pair of participle in the at least one pair of participle as a corresponding participle result according to a comparison result of the first probability and the second probability;
and determining a third participle result of the text to be processed according to the participle result corresponding to each pair of participle words in at least one pair of participle words in the text to be processed.
In the foregoing aspect, the determining the first probability and the second probability according to the set dictionary includes:
according to the left adjacent word of each pair of divergent words in the at least one pair of divergent words, carrying out deletion processing on the set dictionary; the words in the set dictionary after the deletion processing are the same as the segment initial words of the left adjacent words of each divergent word in the at least one pair of divergent words;
and determining a first probability and a second probability according to the set dictionary after the deletion processing.
In the foregoing solution, for each candidate word in the first group of candidate words, determining a word frequency probability corresponding to each candidate word in a set word frequency dictionary, where the method further includes:
under the condition that the candidate words in the first group of candidate words do not have corresponding word frequency probabilities in the word frequency dictionary, setting the word frequencies of the candidate words without the corresponding word frequency probabilities as set values; the set value is a value greater than 0.
In the foregoing solution, the determining, according to the second parameter of the candidate word in the second group of candidate words of the text to be processed, includes:
for each candidate word in a second group of candidate words, determining a part-of-speech probability corresponding to each candidate word in a set part-of-speech dictionary, and determining a second parameter of each candidate word in the second group of candidate words according to the part-of-speech probability corresponding to each candidate word and the corresponding part-of-speech transition probability; the part-of-speech transition probability is a connection probability of parts-of-speech.
The embodiment of the present invention further provides a word segmentation apparatus, including:
the first determining unit is used for selecting a first target candidate word from a first group of candidate words according to a first parameter of the candidate words in the first group of candidate words of the text to be processed in a backward-forward direction to obtain a first word segmentation result of the text to be processed, wherein the first group of candidate words is a group of candidate words obtained by segmenting the text to be processed in the forward-backward direction, and the first parameter is accumulated word frequency probability of all candidate word combinations in left-adjacent words of one candidate word; the first target candidate word is a left adjacent word corresponding to the maximum first parameter of one candidate word;
a second determining unit, configured to select a second target candidate word from a second group of candidate words of the to-be-processed text according to a second parameter of the candidate words in the second group of candidate words in the to-be-processed text from a front-to-back direction, so as to obtain a second word segmentation result of the to-be-processed text, where the second group of candidate words is a group of candidate words obtained by segmenting the to-be-processed text according to the front-to-back direction, and the second parameter is an accumulated part-of-speech probability of all candidate word combinations in a right neighboring word of one candidate word; the second target candidate word is a right adjacent word corresponding to the largest second parameter of the candidate word;
a third determining unit, configured to determine at least one pair of divergent words in the first segmentation result and the second segmentation result, where a pair of divergent words includes one of the first segmentation result and one of the second segmentation result, and a pair of divergent words corresponds to a difference in a degree between the first segmentation result and the second segmentation result;
and the fourth determining unit is used for determining a participle from the pair of participles as a final participle result of the pair of participles according to a set algorithm so as to obtain a third participle result of the text to be processed.
An embodiment of the present invention further provides an electronic device, including: a processor and a memory for storing a computer program capable of running on the processor,
wherein the processor is configured to perform the steps of any of the above methods when running the computer program.
An embodiment of the present invention further provides a storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of any one of the above methods.
In the embodiment of the present invention, a first target candidate word is selected from a first group of candidate words according to a first parameter of the candidate words in the first group of candidate words of the text to be processed, and a first segmentation result of the text to be processed is obtained, where the first group of candidate words is a group of candidate words obtained by segmenting the text to be processed according to a direction from front to back, the first parameter is an accumulated word frequency probability of all candidate word combinations in left-adjacent words of one candidate word, the first target candidate word is a left-adjacent word corresponding to a maximum first parameter of one candidate word, a second target candidate word is selected from a second group of candidate words according to a direction from front to back according to a second parameter of a candidate word in the second group of candidate words of the text to be processed, and a second segmentation result of the text to be processed is obtained, where the second group of candidate words is a group of candidate words obtained by segmenting the text to be processed according to the direction from back to front, the second parameter is the accumulated part-of-speech probability of all candidate word combinations in the right adjacent words of a candidate word, the second target candidate word is the right adjacent word corresponding to the maximum second parameter of the candidate word, at least one pair of divergent words in the first word segmentation result and the second word segmentation result is determined, wherein the pair of divergent words comprises one divergent word in the first word segmentation result and one divergent word in the second word segmentation result of the first word segmentation result, the pair of divergent words corresponds to one part-of-word difference between the first word segmentation result and the second word segmentation result, one divergent word is determined from the pair of divergent words to be used as the final word segmentation result of the pair of divergent words according to a set algorithm so as to obtain the third word segmentation result of the text to be processed, two different word segmentation results of the text to be processed are determined through word frequency and part-of-speech, the word segmentation difference can be better positioned, the divergent words are determined and analyzed, the method can better process the word segmentation result corresponding to the divergent words, thereby improving the word segmentation processing effect on the complex sentences and enhancing the accuracy of word segmentation.
Drawings
Fig. 1 is a schematic flow chart illustrating an implementation of the word segmentation method according to an embodiment of the present invention.
Fig. 2 is a schematic flow chart illustrating an implementation of the word segmentation method according to another embodiment of the present invention.
Fig. 3 is a schematic flow chart illustrating an implementation of the word segmentation method according to another embodiment of the present invention.
Fig. 4 is a schematic flow chart illustrating an implementation of the word segmentation method according to another embodiment of the present invention.
Fig. 5 is a schematic flow chart illustrating an implementation of the word segmentation method according to another embodiment of the present invention.
FIG. 6 is a flowchart illustrating a process of performing word segmentation on a text to be processed according to an embodiment of the present invention
Fig. 7 is a schematic structural diagram of a word segmentation apparatus according to an embodiment of the present invention.
Fig. 8 is a schematic diagram of a hardware component structure of an electronic device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
The technical means described in the embodiments of the present invention may be arbitrarily combined without conflict.
In addition, in the embodiments of the present invention, "first", "second", and the like are used for distinguishing similar objects, and are not necessarily used for describing a specific order or a sequential order.
Fig. 1 shows an implementation flow of the word segmentation method provided by the embodiment of the present invention. As shown in fig. 1, the method includes:
s101: selecting a first target candidate word from a first group of candidate words according to a first parameter of the candidate words in the first group of candidate words of the text to be processed, and obtaining a first word segmentation result of the text to be processed, wherein the first group of candidate words is a group of candidate words obtained by segmenting the text to be processed according to the direction from front to back, and the first parameter is the accumulated word frequency probability of all candidate word combinations in the left adjacent words of one candidate word; the first target candidate word is a left neighboring word corresponding to the largest first parameter of one candidate word.
The text to be processed is a corpus to be word-segmented, word segmentation is performed on the text to be processed in a front-to-back direction (also called forward word segmentation) to obtain a first group of candidate words, and a first target candidate word is selected from the first group of candidate words in the front-to-back direction according to a first parameter of each candidate word in the first group of candidate words of the text to be processed, wherein the first parameter refers to an accumulated word frequency probability of all candidate word combinations in left-adjacent words of one candidate word, and the first target candidate word refers to a left-adjacent word corresponding to a maximum first parameter of one candidate word. For example, the text to be processed is "people of china", and word segmentation is performed in a direction from the front to the back, that is, in a direction from the beginning to the end of the sentence, so as to obtain a first group of candidate words, where the first group of candidate words respectively includes "middle", "china", "people", and "people", a first parameter of each candidate word in the first group of candidate words is determined, the first parameter is an accumulated word frequency probability representing all candidate combinations in left-adjacent words of the candidate words, for example, for the candidate word "people", all candidate combinations in left-adjacent words include "middle/people", "china/people", and "people", the accumulated word frequency probabilities of all candidate combinations in left-adjacent words in the candidate word of "people" are recorded, and the left-adjacent word corresponding to the largest first parameter of each candidate word is selected from the first group of candidate words according to the direction from the back to the front, for example, if the left neighboring word corresponding to the largest first parameter of the candidate word "people" is "china", then "china" is the first target candidate word, and the first word segmentation result of the text to be processed is obtained according to the first target candidate word obtained in each candidate word. In practical applications, the text to be processed may be regarded as a word string S, and then each candidate word in the first set of candidate words of the text to be processed determined according to the front-to-back direction can be regarded as w1, w2, … …, wn, where n is the number of segmented words obtained based on the front-to-back direction corresponding to the text to be processed.
In an embodiment, as shown in fig. 2, the selecting, according to a first parameter of a candidate word in a first group of candidate words of the text to be processed, a first target candidate word from the first group of candidate words in a backward-forward direction to obtain a first segmentation result of the text to be processed includes:
s201: and determining the corresponding word frequency probability of each candidate word in the set word frequency dictionary for each candidate word in the first group of candidate words, and determining the corresponding first parameter of each candidate word according to the corresponding word frequency probability of each candidate word in the set word frequency dictionary.
Here, the word frequency dictionary records the word frequency probability of the word, the word frequency dictionary is searched, the word frequency probability corresponding to each candidate word in the first group of candidate words can be determined, and the first parameter corresponding to each candidate word in the first group of candidate words is determined according to the word frequency probability corresponding to each candidate word. Illustratively, taking the text to be processed as "the people of China" as an example, the word frequency probability of "middle" can be determined to be P through the word frequency dictionary1The word frequency probability of China is P2Word frequency ofProbability of P3The word frequency probability of 'Chinese' is P4The word frequency probability of 'human' is P5The word frequency probability of 'people' is P6The word frequency probability of 'people' is P7When calculating the first parameter of each candidate word, the first parameter is calculated according to the word frequency probability of the candidate word and the word frequency probability corresponding to all candidate word combinations in the left neighboring words of the candidate word, for example, when calculating the cumulative word frequency probability of "people" and the left neighboring word "china", the first parameter is calculated according to P6And P2And (5) performing operation to obtain the product. In practical application, the word frequency dictionary is a word frequency dictionary which is formed by calculating the word frequency of each word through different test sets such as a daily newspaper image-text database and the like, collecting data such as network keywords, hot words and the like and combining the data with the test sets. Because the collected data volume is huge, the word frequency corresponding to some words is low, the word frequency probability is low, overflow can be generated during operation, and the second word frequency corresponding to the words cannot be determined, in order to avoid the phenomenon, logarithm processing can be carried out on the obtained word frequency probability, wherein the specific expression is P ═ log (count +1/max _ count +1), wherein P represents the word frequency probability corresponding to each word, count represents the word frequency corresponding to each word, max _ count represents the total amount of all collected data, and when the word frequency probability is represented by a log domain, the word frequency probability of a candidate word and the word frequency probability of a left neighbor word are added to obtain a first parameter, so that the data volume of the operation can be reduced, and the operation speed can be improved. In practical application, the word frequency dictionary can be updated regularly, and the set statistical data of the word frequency dictionary can be updated through the collected network data, network hot words and other data. The set time can be updated in a month unit, or the set word frequency dictionary can be updated by automatic detection after the user sends an update instruction.
S202: and selecting a first target candidate word from the first group of candidate words according to the backward-forward direction based on the maximum first parameter of each candidate word to obtain a first word segmentation result of the text to be processed.
After determining the first parameter of each candidate word in the first group of candidate words, selecting a first target candidate word from the first group of candidate words according to the backward-forward direction, and obtaining a first word segmentation result of the text to be processed. For example, taking the text to be processed as "people of china" as an example, when selecting words from the backward to forward direction, starting with "people", assuming that the accumulated word frequency probability of the candidate word combination of "people" is determined to be the maximum according to the maximum first parameter of "people", taking "people" as a first target candidate word, then determining the left neighboring word of "hua" according to the first parameter corresponding to "hua", and connecting the determined first target candidate words, so as to obtain the first word segmentation result of the text to be processed. In practical applications, when the first target candidate word is selected, the word length of the first target candidate word may also be determined, for example, when "people" is determined as the first target candidate word, the word length corresponding to "people" may be determined to be 2, when the next candidate word is determined from the backward direction, the word length corresponding to "people" is skipped, and the first target candidate word is selected for the candidate word "hua".
In the above embodiment, for each candidate word in the first group of candidate words, the word frequency probability corresponding to each candidate word in the set word frequency dictionary is determined, the first parameter corresponding to each candidate word is determined according to the word frequency probability corresponding to each candidate word in the set word frequency dictionary, based on the maximum first parameter of each candidate word, the first target candidate word is selected from the first group of candidate words in the backward-forward direction, so as to obtain the first participle of the text to be processed, and the word segmentation result corresponding to the text to be processed can be determined based on the word frequency, so that the word segmentation result is more accurate and strict, which is beneficial to improving the recognition effect of the participle segment and improving the effectiveness of the participle segment.
In an embodiment, for each candidate word in the first group of candidate words, determining a word frequency probability corresponding to each candidate word in a set word frequency dictionary, the method further includes:
under the condition that the candidate words in the first group of candidate words do not have corresponding word frequency probabilities in the word frequency dictionary, setting the word frequencies of the candidate words without the corresponding word frequency probabilities as set values; the set value is a value greater than 0.
Here, when the word frequency probability of each candidate word in the first group of candidate words is searched through the word frequency dictionary, a situation that the candidate word in the first group of candidate words does not have the corresponding word frequency probability in the word frequency dictionary may occur, in this case, the word frequency of the candidate word without the corresponding word frequency probability is set to be a set value, the set value is a numerical value larger than 0, after the word frequency of the candidate word without the corresponding word frequency probability in the word frequency dictionary is set to be the set value, the corresponding word frequency probability is calculated according to the total number of words collected in the word frequency dictionary and the corresponding word frequency. In practical applications, a laplacian smoothing technique may be used to ensure that the lowest word frequency of each candidate word in the first set of candidate words in the word frequency dictionary is 1.
In the above embodiment, when the candidate word in the first group of candidate words does not have the corresponding word frequency probability in the word frequency dictionary, the word frequency of the candidate word without the corresponding word frequency probability is set as the set value, and the set value is a numerical value greater than 0, so that the corresponding word segmentation result can be obtained even when the word frequency corresponding to the word does not exist in the word frequency dictionary, thereby determining the word segmentation segment, processing the word segmentation segment, and improving the accuracy of word segmentation.
S102: selecting a second target candidate word from a second group of candidate words according to a second parameter of the candidate words in the second group of candidate words of the text to be processed in the front-to-back direction to obtain a second word segmentation result of the text to be processed, wherein the second group of candidate words is a group of candidate words obtained by segmenting the text to be processed in the front-to-back direction, and the second parameter is the accumulated part-of-speech probability of all candidate word combinations in right adjacent words of one candidate word; the second target candidate word is a right neighboring word corresponding to the largest second parameter of the candidate word.
The method comprises the steps of performing word segmentation processing (also called backward word segmentation) on a text to be processed from back to front, determining different candidate words, obtaining a second group of candidate words, selecting a second target candidate word from the second group of candidate words according to the direction from front to back, and obtaining a second word segmentation result of the text to be processed, wherein a second parameter is the accumulated part-of-speech probability of all candidate word combinations in the right neighbor words of one candidate word, and the second target candidate word is the right neighbor word corresponding to the largest second parameter of one candidate word. Exemplarily, taking the text to be processed as "people of China" as an example, performing word segmentation on the text to be processed from back to front to obtain a second group of candidate words consisting of "people", "China" and "China", and determining a second parameter corresponding to each candidate word, wherein the second parameter can be obtained by accumulating the part-of-speech probabilities of the candidate words and the right neighboring words. After the second parameters corresponding to all candidate words are determined, from front to back, starting from the first candidate word at the beginning of the sentence, determining the right neighbor word corresponding to the maximum second parameter of the first candidate word as a second target candidate word, and connecting the second target candidate words to obtain a second word segmentation result of the text to be processed.
In an embodiment, the second parameter according to a candidate word in the second set of candidate words of the text to be processed includes:
for each candidate word in a second group of candidate words, determining a part-of-speech probability corresponding to each candidate word in a set part-of-speech dictionary, and determining a second parameter of each candidate word in the second group of candidate words according to the part-of-speech probability corresponding to each candidate word and the corresponding part-of-speech transition probability; the part-of-speech transition probability is a connection probability of parts-of-speech.
Here, the part-of-speech probability of each candidate word in the second set of candidate words is determined through a set part-of-speech dictionary, where the part-of-speech probability refers to the probability that each candidate word is in correspondence with different parts-of-speech, and may have multiple parts-of-speech for one candidate word, and the probability that a candidate word is in each part-of-speech is different, for example, for a word of "action", it may be a noun or a verb, and the part-of-speech probability corresponding to "action" refers to the probability that "action" is a noun and the probability that "action" is a verb, and the second parameter of each candidate word in the second set of candidate words is determined according to the part-of-speech probability of each candidate word in the second set of candidate words and the corresponding part-of-speech transition probability. In practical application, under the condition that the candidate words have different parts of speech, the probability of connection between the candidate words and the words with different parts of speech is different, the parts of speech and the part of speech transition probability of the candidate words are integrated, and the accumulated part of speech probability corresponding to all candidate word combinations in the right adjacent words of each candidate word is determined. For example, taking the text to be processed as the "action organization" as an example, by querying a preset part-of-speech dictionary, it may be determined that the probability that the part-of-speech of the "action" is a verb is 0.3, the probability that the part-of-speech of the "organization" is a noun is 0.7, the probability that the part-of-speech of the "organization" is a noun is 1, the probability that the verb is connected to the noun is 0.6, the probability that the part-of-speech of the "organization" and the right neighboring word "action" is a noun is 0.7, 0.1, 0.07, the second parameter that the part-of the "organization" and the right neighboring word "action" is a noun is 0.3, 0.6, 1, 0.18, and by comparing the second parameters, it may be determined that the probability that the "action" is a verb is the maximum, that the second target word corresponding to the maximum second parameter that is the "organization" action "is the maximum" target word ". In practical applications, the position of a word in a sentence also has a certain influence on the part of speech of the word, for example, in a complete sentence, the beginning of the sentence is often a noun, and a verb is less common as the beginning of the sentence, so that the probability that the noun is located at the beginning of the sentence is greater than the probability that the verb is located at the beginning of the sentence. When determining the part of speech of each word in the word segmentation result in the second word order direction, the position of the word in the sentence is also considered. In practical applications, when the predetermined part-of-speech dictionary records the part-of-speech probability of a word, the part-of-speech probability when the word is located at the beginning of the sentence and the part-of-speech probability when the word is located at the end of the sentence are included, and the part-of-speech probability when the word is located at the beginning of the sentence can be denoted as BEG → n, which indicates the probability that a noun is at the beginning of the sentence. In practical application, the part-of-speech index of each word is recorded in a set part-of-speech dictionary, the part-of-speech dictionary can be obtained through training, data is collected to generate a training set, the training set contains a plurality of predictions, the training set is analyzed and counted on the basis of a language technology platform, the part-of-speech corresponding to each word in the training set is calculated, each part-of-speech of each word is counted to obtain the part-of-speech index corresponding to each word, and the part-of-speech index of each word obtained through counting is stored in the part-of-speech dictionary.
In the above embodiment, for each candidate word in the second group of candidate words, a part-of-speech probability corresponding to each candidate word in the set part-of-speech dictionary is determined, and according to the part-of-speech probability corresponding to each candidate word and a corresponding part-of-speech transition probability, a second parameter of each candidate word in the second group of candidate words is determined, where the part-of-speech transition probability is a part-of-speech connection probability, and a word segmentation result can be determined according to the part-of-speech probability of the candidate word and the probability that the candidate word is connected with words of different parts-of-speech, and different probabilities corresponding to different positions of the candidate word in a sentence are also considered, so that the processing effect of the word segmentation result can be improved.
S103: determining at least one pair of divergent words in the first word segmentation result and the second word segmentation result, wherein the pair of divergent words comprises one word segmentation in the first word segmentation result and one word segmentation in the second word segmentation result, and the pair of divergent words corresponds to a participle difference between the first word segmentation result and the second word segmentation result.
After determining the first segmentation result and the second segmentation result corresponding to the text to be processed, comparing the first segmentation result and the second segmentation result corresponding to the text to be processed, and determining at least one pair of divergent words in the first segmentation result and the second segmentation result, wherein each divergent word respectively corresponds to one difference between the first pair of segmentation results and the second segmentation result. If the text to be processed is a simple short sentence, a pair of divergent words may exist between the first word segmentation result and the second word segmentation result corresponding to the text to be processed. If the text to be processed is a complex long sentence or text, multiple pairs of divergent words may exist between the first segmentation result and the second segmentation result corresponding to the text to be processed. The term "meaning" means that the term difference exists between the first term result and the second term result, for example, the text to be processed is "meaning/meaning" meaning/meaning "or" meaning, for a "meaning, for a combination of a combination, for a combination, or combination. In practical application, the first word segmentation result is obtained according to the word frequency probability of the text to be processed, and the second word segmentation result is obtained according to the part-of-speech probability of the text to be processed, so that the first word segmentation result and the second word segmentation result are different in probability, the places where most word segmentation results are different in the text to be processed can be well found, and the word segmentation processing effect can be improved.
S104: and determining a participle from the pair of participles as a final participle result of the pair of participles according to a set algorithm so as to obtain a third participle result of the text to be processed.
After determining at least one pair of divergent words, determining a final word segmentation result of each pair of divergent words according to a set algorithm between a word segmentation result corresponding to each pair of divergent words in the first word segmentation result and a word segmentation result corresponding to each divergent word segment in the second word segmentation result, and obtaining a third word segmentation result of the text to be processed. In practical application, the divergent words can be processed through a bigram algorithm, so that the word segmentation result corresponding to the divergent words is determined in the first word segmentation result or the second word segmentation result. For example, the first participle result is "having/opinion/divergence", the second participle result is "intention/seeing/divergence", two pairs of participles exist, the first pair of participles are "having" and "intention", the second pair of participles are "seeing" and "opinion", the participle is processed by bigram algorithm, after the participle is processed, the final participle result is selected from each pair of participles, and the final participle result of the text to be processed is obtained according to the selected participle result. The Bigram algorithm can determine the word segmentation result in a pair of word segmentation results of 'having' and 'intention', if the Bigram algorithm determines that the word segmentation result corresponding to the word segmentation is 'having' and 'opinion', the 'having/opinion' is used as the word segmentation result at the position where the word segmentation is generated, and the third word segmentation result of the text to be processed is determined according to the word segmentation result at the position where the word segmentation is generated.
In the above embodiment, a first target candidate word is selected from a first group of candidate words according to a first parameter of the candidate words in the first group of candidate words of the text to be processed, and a first segmentation result of the text to be processed is obtained, where the first group of candidate words is a group of candidate words obtained by segmenting the text to be processed according to a direction from front to back, the first parameter is an accumulated word frequency probability of all candidate word combinations in a left neighboring word of one candidate word, the first target candidate word is a left neighboring word corresponding to a maximum first parameter of one candidate word, a second target candidate word is selected from a second group of candidate words according to a direction from front to back according to a second parameter of the candidate words in the second group of candidate words of the text to be processed, and a second segmentation result of the text to be processed is obtained, where the second group of candidate words is a group of candidate words obtained by segmenting the text to be processed according to the direction from back to front, the second parameter is the accumulated part-of-speech probability of all candidate word combinations in the right adjacent words of a candidate word, the second target candidate word is the right adjacent word corresponding to the maximum second parameter of the candidate word, at least one pair of divergent words in the first word segmentation result and the second word segmentation result is determined, wherein the pair of divergent words comprises one divergent word in the first word segmentation result and one divergent word in the second word segmentation result of the first word segmentation result, the pair of divergent words corresponds to one part-of-word difference between the first word segmentation result and the second word segmentation result, one divergent word is determined from the pair of divergent words to be used as the final word segmentation result of the pair of divergent words according to a set algorithm so as to obtain the third word segmentation result of the text to be processed, two different word segmentation results of the text to be processed are determined through word frequency and part-of-speech, the divergent words of the text to be processed can be better determined on the premise of not improving the computational complexity of the algorithm, the recognition of the divergent words is enhanced, the divergent words are analyzed, the divergent words can be better processed, and the word segmentation results corresponding to the divergent words are confirmed, so that the word segmentation processing effect on the complex sentences is improved, and the word segmentation accuracy is enhanced.
In an embodiment, as shown in fig. 3, the determining, according to a set algorithm, a participle from the pair of participles as a final participle result of the pair of participles to obtain a third participle result of the text to be processed includes:
s301: determining a first word segmentation quantity of the first word segmentation result and a second word segmentation quantity of the second word segmentation result.
Here, when determining to process the divergent words, it is necessary to determine a first word segmentation quantity of the first word segmentation result and a second word segmentation quantity of the second word segmentation result, where the word segmentation quantity refers to a quantity of words divided in the word segmentation result. For example, the word segmentation result of "having/opinion/diverging" corresponds to a word segmentation number of 3.
S302: and detecting whether the first participle quantity is the same as the second participle quantity to obtain a detection result.
After determining the first part word quantity and the second part word quantity, detecting whether the first part word quantity and the second part word quantity are the same or not to obtain a detection result, wherein the detection result has two types, including that the first part word quantity is the same as the second part word quantity, and the first part word quantity is different from the second part word quantity. For example, the first word segmentation result is that the number of the first words corresponding to "having/seeing/diverging" is 3, the second word segmentation result is that the number of the second words corresponding to "having/seeing/diverging" is 3, and when the first word segmentation result and the second word segmentation result are detected, the obtained detection result is that the number of the first words is the same as that of the second words. When the first word segmentation result is that the number of the first words corresponding to the first word segmentation result is 5 and the second word segmentation result is that the number of the second words corresponding to the first word segmentation result is 4, whether the number of the first words is the same as that of the second words is detected, and the obtained detection result is that the number of the first words is different from that of the second words.
S303: and selecting a participle from the pair of participles as a final participle result of the pair of participles by adopting a set algorithm corresponding to the detection result so as to obtain a third participle result of the text to be processed.
After the detection result is obtained, a set algorithm corresponding to the detection result is adopted for the divergent words according to the detection result, one divergent word is selected from the pair of divergent words to serve as a final divergent word result of the pair of divergent words, and a third divergent word result of the text to be processed is obtained. The divergent words may correspond to divergent words caused by different numbers of divergent words or different meanings of the divergent words, and the processing methods of the divergent words generated by different reasons are different, so that an appropriate processing method needs to be selected for the divergent words according to the detection result. When the detection result shows that the first word segmentation quantity is different from the second word segmentation quantity, the word segmentation result corresponding to the proper word segmentation quantity is selected as the third word segmentation result of the text to be processed according to the user requirement, and in practical application, the word segmentation result corresponding to the larger word segmentation quantity is usually selected as the word segmentation result in a pair of word segmentations, so that the processing of the word segmentations is completed. And when the detection result shows that the first component word quantity is the same as the second component word quantity, selecting an algorithm capable of processing the word sense for the divergent sentence section.
In the above embodiment, whether the first participle quantity of the first participle result is the same as the second participle quantity of the second participle result is detected by determining the first participle quantity of the first participle result and the second participle quantity of the second participle result, so as to obtain a detection result, and a set algorithm corresponding to the detection result is adopted to select a participle from a pair of participle as a final participle result of the pair of participle, so as to obtain a third participle result of the text to be processed.
In an embodiment, as shown in fig. 4, the determining, by using a setting algorithm corresponding to the detection result, a participle from the pair of participles as a final participle result of the pair of participles to obtain a third participle result of the to-be-processed text includes:
s401: under the condition that the detection result represents that the number of the first participles is the same as that of the second participles, determining a first probability and a second probability according to a set dictionary; the set dictionary stores the probability of connection between the participles; the first probability is the probability of connection between each participle in the at least one participle in the first participle result and an adjacent word; the second probability is the probability of connection between each participle in at least one participle in the second participle result and adjacent words.
Here, when the detection result indicates that the number of the first participles is the same as the number of the second participles, determining a first probability and a second probability according to a set dictionary, wherein the set dictionary records connection probabilities between different words, the first probability is the connection probability of one of a pair of participles in the first participle result and an adjacent word, and the second probability is the connection probability of one of a pair of participles in the second participle result and the adjacent word. When the detection result indicates that the number of the first participles is the same as that of the second participles, the method indicates that the divergent word segments are caused by different word senses, and the word senses of the divergent words in the first participle result and the second participle result need to be analyzed, so that the divergent words can be correspondingly processed. In practical application, whether the word meaning of the divergent words is a common usage or not can be judged through the connection probability of the divergent words and the adjacent words. The connection probability corresponding to the divergent words can be obtained from the set dictionary. For example, two pairs of divergent words are corresponding to the first word segmentation result "having/opinion/divergence" and the second word segmentation result "intention/seeing/divergence", wherein the first pair of divergent words is "having" and "intention", and for the first pair of divergent words, the first probability that "having" is connected with "opinion" can be obtained by setting the dictionary, and the second probability that "intention" is connected with "seeing" can be obtained from the setting dictionary. In practical application, the set dictionary can count the connection condition and the corresponding connection probability among each word on the basis of a training set of the word frequency dictionary, and store the connection probability in the set dictionary. To speed up the computation, the connection probability can be converted into a log domain probability value.
S402: and selecting one participle from each pair of participle in the at least one pair of participle as a corresponding participle result according to the comparison result of the first probability and the second probability.
Here, after determining the first probability and the second probability corresponding to at least one pair of divergent words, the first probability corresponding to one divergent word in the pair of divergent words is compared with the second probability corresponding to the other divergent word, and the word segmentation result corresponding to the pair of divergent words is determined according to the comparison result of the first probability and the second probability. In practical application, the participle result corresponding to the maximum connection probability in a pair of the participle words is used as the stage result corresponding to the pair of the participle words. When the connection probability corresponding to the divergent word is larger, the prevalence rate of the use of the divergent word is high, and the divergent word is more likely to be used as a processing result of a pair of divergent words. For example, when a first probability of "having" the connection with "the opinion" in a pair of divergent words is greater than a second probability of "intentionally" seeing "in a pair of divergent words, the result of the word segmentation corresponding to the divergent words is" having/opinion ".
S403: and determining a third participle result of the text to be processed according to the participle result corresponding to each pair of participle words in at least one pair of participle words in the text to be processed.
Here, the third segmentation result of the text to be processed is determined according to the segmentation result corresponding to at least one pair of the segmented words in the text to be processed. Because the word segmentation results of other word segments are the same except for the bifurcation, after the word segmentation result corresponding to at least one pair of the bifurcated words in the text to be processed is determined, the third word segmentation result can be determined according to the word segmentation result corresponding to at least one pair of the bifurcated words and the word segmentation result of the non-bifurcated word segment in the text to be processed. For example, the text to be processed is "intentionally seen bifurcate", the first participle result is "present/opinion/bifurcate", the second participle result is "intent/see/bifurcate", the participle result corresponding to the first pair of participle words "present" and "intent" is determined to be "present" by setting an algorithm, the participle result corresponding to the second pair of participle words "opinion" and "see" is "opinion", and the third participle result "present/opinion/bifurcate" is determined according to the participle result corresponding to the bifurcation in the text to be processed being "present/opinion" and the participle result corresponding to the non-participle word segment in the text to be processed being "present/opinion/bifurcate".
In the above embodiment, when the detection result indicates that the number of the first participles is the same as the number of the second participles, determining a first probability and a second probability according to the set dictionary, wherein the set dictionary stores the probability of connection between the participles, the first probability is the probability of connection between each participle and an adjacent word in at least one participle in the first participle result, the second probability is the probability of connection between each participle and an adjacent word in at least one participle in the second participle result, and according to the comparison result of the first probability and the second probability, one participle is selected from each pair of participles in at least one pair of participles as a corresponding participle result, so that the participle result corresponding to the participle can be determined under the condition that the number of the first participle is the same as the number of the second participle, and the participle result corresponding to the participle can be determined by determining the probability of connection between the participle and the adjacent word, so that the participle result can be more fit to the daily used word meaning, the accuracy of word segmentation is improved, and the effect of word segmentation processing is enhanced.
In an embodiment, as shown in fig. 5, the determining the first probability and the second probability according to the set dictionary includes:
s501: according to the left adjacent word of each pair of divergent words in the at least one pair of divergent words, carrying out deletion processing on the set dictionary; and the words in the set dictionary after the deletion processing are the same as the segment initial words of the left adjacent words of each divergent word in the at least one pair of divergent words.
Here, since the set dictionary records the connection probabilities of a large number of words, it takes a lot of time to directly find the corresponding connection probability between each divergent word and the neighboring word in at least one pair of divergent words from the set dictionary, thereby reducing the speed of the word segmentation processing. And deleting the set dictionary according to the left adjacent word of each divergent word in the at least one pair of divergent words, so that the words in the set dictionary subjected to deletion processing are the same as the head word of the segment of the left adjacent word of each divergent word in the at least one pair of divergent words. Because the first segmentation word corresponding to the left adjacent word of each divergent word in the first segmentation result is different from the first segmentation word corresponding to the left adjacent word of each divergent word in the second segmentation result, the words in the set dictionary after the deletion processing comprise the first segmentation word corresponding to the left adjacent word of each divergent word in the first segmentation result and the first segmentation word corresponding to the left adjacent word of each divergent word in the second segmentation result. For example, if the left adjacent word of a divergent word "opinion" in the first word segmentation result is "present", the "present" is used as the first word, the left adjacent word of a divergent word "present" in the second word segmentation result is "intention", and the "intention" is used as the first word, the set dictionary retains the connection probability corresponding to the word beginning with the "present" and the connection probability corresponding to the word beginning with the "intention" when deleting the set dictionary, and deletes the connection probabilities corresponding to other words. In practical applications, the set dictionary may be deleted through a prefix tree, a HASH table, or other methods.
S502: and determining a first probability and a second probability according to the set dictionary after the deletion processing.
Here, according to the set dictionary after the deletion processing, a first probability corresponding to each divergent word in at least one pair of divergent words in the first participle result is determined, and a second probability corresponding to each divergent word in at least one divergent word in the second participle result is determined.
In the above embodiment, the set dictionary is pruned according to the left adjacent word of each pair of divergent words in the at least one pair of divergent words, the words in the set dictionary after the pruning processing are the same as the heading phrases of the left adjacent word of each divergent word in the at least one pair of divergent words, and the processing speed of the data can be increased under the condition of ensuring that the corresponding connection probability between the divergent words and the adjacent words is obtained, so that the efficiency of the word segmentation processing is improved, and the reduction of the word segmentation effect caused by the over-fitting condition is prevented.
An application embodiment is further provided in the embodiments of the present invention, and as shown in fig. 6, fig. 6 is a schematic diagram illustrating a word segmentation process performed on a text to be processed.
S601: and acquiring the resource file. Here, the resource file is a dictionary required when performing word segmentation processing, and includes a dictionary for recording a word frequency dictionary, a part-of-speech dictionary, and a probability of connection between recorded words. Acquiring the resource file is completed in an offline state. When a resource file is generated, a sample set is firstly acquired, wherein the sample set can be composed of labeled test sets of the national newspapers 2014, the MSRs, the PKUs and the like and data collected by a network, the data collected by the network comprises data collected including network keywords, hot words and the like, a word frequency dictionary is composed of each word in the sample set and the corresponding word frequency probability by calculating the word frequency probability corresponding to each word in the sample set. The part-of-speech dictionary can be generated by an open source LTP tool, determining the part-of-speech of each word in a sample set by using 863 part-of-speech standard, counting the probability of each part-of-speech of each word to obtain part-of-speech probability, counting the transition probability of two parts-of-speech, forming the part-of-speech probability and the part-of-speech transition probability into a part-of-speech dictionary, considering the influence of the beginning and the END of a sentence on the part-of-speech when counting the transition probability of two parts-of-speech, wherein the part-of-speech probability of the beginning of the sentence is represented by BEG, and the part-of-speech probability of the END of the sentence is represented by END. The dictionary for recording the connection probability among the words obtains the transition probability by calculating the number of transitions between the words in the sample set, and stores the transition probability. When the word frequency probability, the part of speech transition probability and the transition probability are stored in the corresponding dictionary, the corresponding probability is a probability value converted into a log domain by P ═ log (count +1/max _ count + 1).
S602: and loading the resource file. Here, the dictionary required for word segmentation is loaded into the memory by using a prefix tree method.
S603: according to the word frequency dictionary, selecting a first target candidate word from a first group of candidate words obtained by forward word segmentation of the text to be processed according to the backward-forward direction to obtain a first word segmentation result corresponding to the text to be processed. The method comprises the steps of taking all candidate words of a text to be processed from the front to the back to form a first group of candidate words, determining the word frequency probability of each candidate word in a word frequency dictionary, calculating the accumulated word frequency probability of all candidate word combinations in the left adjacent words of each candidate word according to the word frequency probability of each candidate word to obtain a first parameter of each candidate word, selecting the left adjacent word corresponding to the maximum first parameter of each candidate word from the first group of candidate words as a target candidate word according to the back to front direction, and obtaining a first word segmentation result.
S604: and according to the part-of-speech dictionary, selecting a second target candidate word from a second group of candidate words according to the direction from front to back according to a second parameter in the second group of candidate words obtained by backward word segmentation of the text to be processed, and obtaining a second word segmentation result corresponding to the text to be processed. All candidate words of the text to be processed are taken out from the rear to the front to form a second group of candidate words, the part-of-speech probability of each candidate word is determined in a part-of-speech dictionary, the accumulated part-of-speech probability of all candidate word combinations in right adjacent words of each candidate word is calculated according to the part-of-speech probability and the part-of-speech transition probability of each candidate word, a second parameter of each candidate word is obtained, the right adjacent word corresponding to the largest second parameter of each candidate word is selected from the second group of candidate words according to the front to rear direction and serves as a target candidate word, and a second word segmentation result is obtained.
S605: at least one pair of divergent words is determined. Here, the word segmentation result corresponding to the position where the word segmentation result is different from the word segmentation result in the first word segmentation result and the second word segmentation result is determined as a word segmentation, wherein a pair of word segmentation includes one word segmentation in the first word segmentation result and one word segmentation in the second word segmentation result, and a pair of word segmentation corresponds to a word segmentation difference between the first word segmentation result and the second word segmentation result.
S606: and determining a participle from the pair of participles as a final participle result of the pair of participles based on a set algorithm. Here, the processing method selected for the divergent words caused by different causes is different. The divergence caused by the different number of the participles can determine the participle result corresponding to the participle in the participle results with the large number of the participles. If the word is a word segmentation result caused by semantics, the word segmentation can be processed through a bigram algorithm, specifically, the connection probability of the word segmentation result corresponding to the adjacent word and the connection probability of the second word segmentation result corresponding to the word segmentation result are obtained, and the word segmentation result with the high connection probability in the pair of word segmentation results is used as the word segmentation result of the pair of word segmentation.
S607: and obtaining a third word segmentation result. The word segmentation results of other words except the word segmentation result of the first word segmentation result and the word segmentation result of the second word segmentation result are the same, after the word segmentation result corresponding to the word segmentation is determined, the third word segmentation result corresponding to the text to be processed can be obtained according to the word segmentation result of other words except the word segmentation result of the first word segmentation result or the word segmentation result of the second word segmentation result, and the third word segmentation result is the word segmentation result which is processed on the word segmentation section.
In order to implement the word segmentation method according to the embodiment of the present invention, an embodiment of the present invention further provides a word segmentation apparatus, as shown in fig. 7, the word segmentation apparatus includes:
a first determining unit 701, configured to select a first target candidate word from a first group of candidate words of a to-be-processed text according to a first parameter of the candidate words in the first group of candidate words, where the first group of candidate words is a group of candidate words obtained by segmenting the to-be-processed text according to a front-to-back direction, and the first parameter is an accumulated word frequency probability of all candidate word combinations in a left neighboring word of one candidate word, and obtain a first segmentation result of the to-be-processed text; the first target candidate word is a left adjacent word corresponding to the maximum first parameter of one candidate word;
a second determining unit 702, configured to select a second target candidate word from a second group of candidate words of the to-be-processed text according to a second parameter of the candidate words in the second group of candidate words in the to-be-processed text from a front-to-back direction, so as to obtain a second word segmentation result of the to-be-processed text, where the second group of candidate words is a group of candidate words obtained by performing word segmentation on the to-be-processed text according to the back-to-front direction, and the second parameter is an accumulated word probability of all candidate word combinations in a right neighboring word of one candidate word; the second target candidate word is a right adjacent word corresponding to the largest second parameter of the candidate word;
a third determining unit 703, configured to determine at least one pair of divergent words in the first segmentation result and the second segmentation result, where a pair of divergent words includes one of the first segmentation result and one of the second segmentation result, and a pair of divergent words corresponds to a word segmentation difference between the first segmentation result and the second segmentation result;
a fourth determining unit 704, configured to determine a participle from the pair of participles as a participle result of the pair of participles according to a set algorithm, so as to obtain a third participle result of the to-be-processed text.
In an embodiment, the selecting, by the first determining unit 701, a first target candidate word from a first group of candidate words of a to-be-processed text according to a backward-forward direction according to a first parameter of a candidate word in the first group of candidate words to obtain a first word segmentation result of the to-be-processed text, where the selecting includes:
for each candidate word in the first group of candidate words, determining a word frequency probability corresponding to each candidate word in a set word frequency dictionary, and determining a first parameter corresponding to each candidate word according to the word frequency probability corresponding to each candidate word in the set word frequency dictionary;
and selecting a first target candidate word from the first group of candidate words according to the backward-forward direction based on the maximum first parameter of each candidate word to obtain a first word segmentation result of the text to be processed.
In an embodiment, the determining a participle from the pair of participles by the third determining unit 703 according to a set algorithm as a final participle result of the pair of participles to obtain a third participle result of the to-be-processed text, including:
determining a first word segmentation quantity of the first word segmentation result and a second word segmentation quantity of the second word segmentation result;
detecting whether the first participle quantity is the same as the second participle quantity to obtain a detection result;
and selecting a participle from the pair of participles as a final participle result of the pair of participles by adopting a set algorithm corresponding to the detection result so as to obtain a third participle result of the text to be processed.
In an embodiment, the determining a participle from the pair of participles by the third determining unit 703 using a set algorithm corresponding to the detection result as a final participle result of the pair of participles to obtain a third participle result of the to-be-processed text, including:
under the condition that the detection result represents that the number of the first participles is the same as that of the second participles, determining a first probability and a second probability according to a set dictionary; the set dictionary stores the probability of connection between the participles; the first probability is the probability of connection between each participle in the at least one participle in the first participle result and an adjacent word; the second probability is the probability of connection between each participle and adjacent words in at least one participle in the second participle result;
selecting a participle from each pair of participle in the at least one pair of participle as a corresponding participle result according to a comparison result of the first probability and the second probability;
and determining a third word segmentation result of the text to be processed according to the word segmentation result corresponding to the word segmentation position in the text to be processed.
In an embodiment, the determining the first probability and the second probability according to the set dictionary by the third determining unit 703 includes:
according to the left adjacent word of each pair of divergent words in the at least one pair of divergent words, carrying out deletion processing on the set dictionary; the words in the set dictionary after the deletion processing are the same as the segment initial words of the left adjacent words of each divergent word in the at least one pair of divergent words;
and determining a first probability and a second probability according to the set dictionary after the deletion processing.
In an embodiment, the first determining unit 701 determines, for each candidate word in the first group of candidate words, a word frequency probability corresponding to each candidate word in a set word frequency dictionary, and the method further includes:
under the condition that the candidate words in the first group of candidate words do not have corresponding word frequency probabilities in the word frequency dictionary, setting the word frequencies of the candidate words without the corresponding word frequency probabilities as set values; the set value is a value greater than 0.
In an embodiment, the second determining unit 702 determines, according to the second parameter of the candidate word in the second set of candidate words of the text to be processed, to include:
for each candidate word in a second group of candidate words, determining a part-of-speech probability corresponding to each candidate word in a set part-of-speech dictionary, and determining a second parameter of each candidate word in the second group of candidate words according to the part-of-speech probability corresponding to each candidate word and the corresponding part-of-speech transition probability; the part-of-speech transition probability is a connection probability of parts-of-speech.
In practical applications, the first determining unit 701, the second determining unit 702, the third determining unit 703 and the fourth determining unit 704 may be implemented by a processor in the word segmentation apparatus. Of course, the processor needs to run the program stored in the memory to realize the functions of the above-described program modules.
It should be noted that, when performing word segmentation, the word segmentation apparatus provided in the embodiment of fig. 7 is only illustrated by dividing the program modules, and in practical applications, the processing may be distributed to different program modules according to needs, that is, the internal structure of the apparatus is divided into different program modules to complete all or part of the processing described above. In addition, the word segmentation device and the word segmentation method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.
Based on the hardware implementation of the program module, and in order to implement the method according to the embodiment of the present invention, an embodiment of the present invention further provides an electronic device, fig. 8 is a schematic diagram of a hardware composition structure of the electronic device according to the embodiment of the present invention, and as shown in fig. 8, the electronic device includes:
a communication interface 1 capable of information interaction with other devices such as network devices and the like;
and the processor 2 is connected with the communication interface 1 to realize information interaction with other equipment, and is used for executing the word segmentation method provided by one or more technical schemes when running a computer program. And the computer program is stored on the memory 3.
In practice, of course, the various components in the electronic device are coupled together by the bus system 4. It will be appreciated that the bus system 4 is used to enable connection communication between these components. The bus system 4 comprises, in addition to a data bus, a power bus, a control bus and a status signal bus. For the sake of clarity, however, the various buses are labeled as bus system 4 in fig. 8.
The memory 3 in the embodiment of the present invention is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device. It will be appreciated that the memory 3 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced Synchronous Dynamic Random Access Memory), Synchronous link Dynamic Random Access Memory (DRAM, Synchronous Dynamic Random Access Memory), Direct Memory (DRmb Random Access Memory). The memory 2 described in the embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.
The method disclosed by the above embodiment of the present invention can be applied to the processor 2, or implemented by the processor 2. The processor 2 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 2. The processor 2 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 2 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed by the embodiment of the invention can be directly implemented by a hardware decoding processor, or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 3, and the processor 2 reads the program in the memory 3 and in combination with its hardware performs the steps of the aforementioned method.
When the processor 2 executes the program, the corresponding processes in the methods according to the embodiments of the present invention are realized, and for brevity, are not described herein again.
In an exemplary embodiment, the present invention further provides a storage medium, i.e. a computer storage medium, in particular a computer readable storage medium, for example comprising a memory 3 storing a computer program, which is executable by a processor 2 to perform the steps of the aforementioned method. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, terminal and method may be implemented in other manners. The above-described device embodiments are only illustrative, for example, the division of the unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. A method of word segmentation, comprising:
selecting a first target candidate word from a first group of candidate words according to a first parameter of the candidate words in the first group of candidate words of the text to be processed, and obtaining a first word segmentation result of the text to be processed, wherein the first group of candidate words is a group of candidate words obtained by segmenting the text to be processed according to the direction from front to back, and the first parameter is the accumulated word frequency probability of all candidate word combinations in the left adjacent words of one candidate word; the first target candidate word is a left adjacent word corresponding to the maximum first parameter of one candidate word;
selecting a second target candidate word from a second group of candidate words according to a second parameter of the candidate words in the second group of candidate words of the text to be processed in the front-to-back direction to obtain a second word segmentation result of the text to be processed, wherein the second group of candidate words is a group of candidate words obtained by segmenting the text to be processed in the front-to-back direction, and the second parameter is the accumulated part-of-speech probability of all candidate word combinations in right adjacent words of one candidate word; the second target candidate word is a right adjacent word corresponding to the largest second parameter of the candidate word;
determining at least one pair of divergent words in the first word segmentation result and the second word segmentation result, wherein the pair of divergent words comprises one of the first word segmentation result and one of the second word segmentation result, and the pair of divergent words corresponds to a participle difference between the first word segmentation result and the second word segmentation result;
and determining a participle from the pair of participles as a participle result of the pair of participles according to a set algorithm so as to obtain a third participle result of the text to be processed.
2. The word segmentation method according to claim 1, wherein the selecting a first target candidate word from the first group of candidate words according to a first parameter of a candidate word in the first group of candidate words of the text to be processed and a backward-forward direction to obtain a first word segmentation result of the text to be processed comprises:
for each candidate word in the first group of candidate words, determining a word frequency probability corresponding to each candidate word in a set word frequency dictionary, and determining a first parameter corresponding to each candidate word according to the word frequency probability corresponding to each candidate word in the set word frequency dictionary;
and selecting a first target candidate word from the first group of candidate words according to the backward-forward direction based on the maximum first parameter of each candidate word to obtain a first word segmentation result of the text to be processed.
3. The word segmentation method according to claim 1, wherein the determining a word from the pair of divergent words as a final word segmentation result of the pair of divergent words according to a set algorithm to obtain a third word segmentation result of the text to be processed includes:
determining a first word segmentation quantity of the first word segmentation result and a second word segmentation quantity of the second word segmentation result;
detecting whether the first participle quantity is the same as the second participle quantity to obtain a detection result;
and selecting a participle from the pair of participles as a final participle result of the pair of participles by adopting a set algorithm corresponding to the detection result so as to obtain a third participle result of the text to be processed.
4. The word segmentation method according to claim 3, wherein the determining a word segmentation from the pair of divergent words as a final word segmentation result of the pair of divergent words by using a set algorithm corresponding to the detection result to obtain a third word segmentation result of the text to be processed includes:
under the condition that the detection result represents that the number of the first participles is the same as that of the second participles, determining a first probability and a second probability according to a set dictionary; the set dictionary stores the probability of connection between the participles; the first probability is the probability of connection between each participle in the at least one participle in the first participle result and an adjacent word; the second probability is the probability of connection between each participle and adjacent words in at least one participle in the second participle result;
selecting a participle from each pair of participle in the at least one pair of participle as a corresponding participle result according to a comparison result of the first probability and the second probability;
and determining a third participle result of the text to be processed according to the participle result corresponding to each pair of participle words in at least one pair of participle words in the text to be processed.
5. The word segmentation method according to claim 4, wherein the determining a first probability and a second probability according to the set dictionary comprises:
according to the left adjacent word of each pair of divergent words in the at least one pair of divergent words, carrying out deletion processing on the set dictionary; the words in the set dictionary after the deletion processing are the same as the segment initial words of the left adjacent words of each divergent word in the at least one pair of divergent words;
and determining a first probability and a second probability according to the set dictionary after the deletion processing.
6. The word segmentation method according to claim 2, wherein for each candidate word in the first set of candidate words, a word frequency probability of each candidate word in a set word frequency dictionary is determined, and the method further comprises:
under the condition that the candidate words in the first group of candidate words do not have corresponding word frequency probabilities in the word frequency dictionary, setting the word frequencies of the candidate words without the corresponding word frequency probabilities as set values; the set value is a value greater than 0.
7. The word segmentation method according to claim 1, wherein the step of obtaining the second parameter of the candidate word in the second set of candidate words of the text to be processed comprises:
for each candidate word in a second group of candidate words, determining a part-of-speech probability corresponding to each candidate word in a set part-of-speech dictionary, and determining a second parameter of each candidate word in the second group of candidate words according to the part-of-speech probability corresponding to each candidate word and the corresponding part-of-speech transition probability; the part-of-speech transition probability is a connection probability of parts-of-speech.
8. A word segmentation device, comprising:
the first determining unit is used for selecting a first target candidate word from a first group of candidate words according to a first parameter of the candidate words in the first group of candidate words of the text to be processed in a backward-forward direction to obtain a first word segmentation result of the text to be processed, wherein the first group of candidate words is a group of candidate words obtained by segmenting the text to be processed in the forward-backward direction, and the first parameter is accumulated word frequency probability of all candidate word combinations in left-adjacent words of one candidate word; the first target candidate word is a left adjacent word corresponding to the maximum first parameter of one candidate word;
a second determining unit, configured to select a second target candidate word from a second group of candidate words of the to-be-processed text according to a second parameter of the candidate words in the second group of candidate words in the to-be-processed text from a front-to-back direction, so as to obtain a second word segmentation result of the to-be-processed text, where the second group of candidate words is a group of candidate words obtained by segmenting the to-be-processed text according to the front-to-back direction, and the second parameter is an accumulated part-of-speech probability of all candidate word combinations in a right neighboring word of one candidate word; the second target candidate word is a right adjacent word corresponding to the largest second parameter of the candidate word;
a third determining unit, configured to determine at least one pair of divergent words in the first segmentation result and the second segmentation result, where a pair of divergent words includes one of the first segmentation result and one of the second segmentation result, and a pair of divergent words corresponds to a difference in a degree between the first segmentation result and the second segmentation result;
and the fourth determining unit is used for determining a participle from the pair of participles as a final participle result of the pair of participles according to a set algorithm so as to obtain a third participle result of the text to be processed.
9. An electronic device, comprising: a processor and a memory for storing a computer program capable of running on the processor,
wherein the processor is adapted to perform the steps of the method of any one of claims 1 to 7 when running the computer program.
10. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, performing the steps of the method of any one of claims 1 to 7.
CN202010722115.7A 2020-07-24 2020-07-24 Word segmentation method and device, electronic equipment and storage medium Pending CN113971397A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010722115.7A CN113971397A (en) 2020-07-24 2020-07-24 Word segmentation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010722115.7A CN113971397A (en) 2020-07-24 2020-07-24 Word segmentation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113971397A true CN113971397A (en) 2022-01-25

Family

ID=79585767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010722115.7A Pending CN113971397A (en) 2020-07-24 2020-07-24 Word segmentation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113971397A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004034378A1 (en) * 2002-10-08 2004-04-22 Matsushita Electric Industrial Co., Ltd. Language model creation/accumulation device, speech recognition device, language model creation method, and speech recognition method
CN108804642A (en) * 2018-06-05 2018-11-13 中国平安人寿保险股份有限公司 Search method, device, computer equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004034378A1 (en) * 2002-10-08 2004-04-22 Matsushita Electric Industrial Co., Ltd. Language model creation/accumulation device, speech recognition device, language model creation method, and speech recognition method
CN108804642A (en) * 2018-06-05 2018-11-13 中国平安人寿保险股份有限公司 Search method, device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SIDOROV GRIGORI 等: "Syntactic n-grams as machine learning features for natural language processing", 《EXPERT SYSTEMS WITH APPLICATIONS》, vol. 41, no. 3, 15 February 2014 (2014-02-15), pages 853 - 860, XP055227956, DOI: 10.1016/j.eswa.2013.08.015 *
马力: "基于聚类分析的网络用户兴趣挖掘方法研究", 《中国博士学位论文全文数据库信息科技辑》, no. 12, 15 December 2014 (2014-12-15), pages 139 - 12 *

Similar Documents

Publication Publication Date Title
CN110162627B (en) Data increment method and device, computer equipment and storage medium
US11030517B2 (en) Summary obtaining method, apparatus, and device, and computer-readable storage medium
WO2019062001A1 (en) Intelligent robotic customer service method, electronic device and computer readable storage medium
CN109147934A (en) Interrogation data recommendation method, device, computer equipment and storage medium
CN110874531A (en) Topic analysis method and device and storage medium
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN112380244B (en) Word segmentation searching method and device, electronic equipment and readable storage medium
CN108427702B (en) Target document acquisition method and application server
CN108563655A (en) Text based event recognition method and device
US20160188569A1 (en) Generating a Table of Contents for Unformatted Text
CN109271641A (en) A kind of Text similarity computing method, apparatus and electronic equipment
CN114330335B (en) Keyword extraction method, device, equipment and storage medium
CN110909120A (en) Resume searching/delivering method, device and system and electronic equipment
CN113407677A (en) Method, apparatus, device and storage medium for evaluating quality of consultation session
CN110347900B (en) Keyword importance calculation method, device, server and medium
CN110245361B (en) Phrase pair extraction method and device, electronic equipment and readable storage medium
CN112183117A (en) Translation evaluation method and device, storage medium and electronic equipment
CN113761161A (en) Text keyword extraction method and device, computer equipment and storage medium
CN113761104A (en) Method and device for detecting entity relationship in knowledge graph and electronic equipment
CN116226681B (en) Text similarity judging method and device, computer equipment and storage medium
CN109508390B (en) Input prediction method and device based on knowledge graph and electronic equipment
CN116484829A (en) Method and apparatus for information processing
CN108763258B (en) Document theme parameter extraction method, product recommendation method, device and storage medium
CN115952332A (en) Core search phrase determining method based on co-occurrence word frequency
CN113971397A (en) Word segmentation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination