CN114065739A

CN114065739A - Text word segmentation method and device, electronic equipment and computer readable medium

Info

Publication number: CN114065739A
Application number: CN202111341192.9A
Authority: CN
Inventors: 郑参吾
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-02-18

Abstract

The embodiment of the disclosure discloses a text word segmentation method, a text word segmentation device, electronic equipment and a computer readable medium. One embodiment of the method comprises: creating a first initial thesaurus instance; and for each target text corresponding to the target task information, executing a thesaurus example information replacement step: acquiring a lexicon instance information sequence corresponding to the category information sequence, wherein each category information in the category information sequence is the category information corresponding to the target text; and sequentially replacing the word stock example information of the first initial word stock example according to the sequence of each word stock example information in the word stock example information sequence to perform word segmentation on the target text. The implementation method can quickly and efficiently perform word segmentation on the text.

Description

Text word segmentation method and device, electronic equipment and computer readable medium

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a text word segmentation method, a text word segmentation device, electronic equipment and a computer readable medium.

Background

In the field of natural language, a text segment may be a word that splits a sentence or an article into one. Because there are many industry-specific vocabularies in each industry field, in order to divide words more accurately, it is often necessary to maintain a plurality of word banks. For segmenting a target text by using a plurality of word banks, the following method is generally adopted: in the process of scanning the target text, corresponding word libraries are loaded to perform word segmentation aiming at different stages in the target text.

However, when the above method is used to perform word segmentation on the target text, the following technical problems often exist:

first, when a word bank is used to perform word segmentation, the word bank is often required to be loaded into a memory. Because the Chinese word stock is generally large, and the word segmentation of the target text usually needs to load a plurality of word stocks, the problem of long loading time exists, and the efficiency of word segmentation of the target text is low.

Second, loading multiple word banks also occupies a large amount of memory resources.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose text segmentation methods, apparatuses, electronic devices and computer readable media to solve one or more of the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a text segmentation method, including: creating a first initial thesaurus instance; and for each target text corresponding to the target task information, executing a thesaurus example information replacement step: acquiring a lexicon instance information sequence corresponding to a category information sequence, wherein each category information in the category information sequence is the category information corresponding to the target text; and sequentially replacing the word stock example information of the first initial word stock example according to the sequence of each word stock example information in the word stock example information sequence to perform word segmentation on the target text.

Optionally, before the step of performing text segmentation on each target text corresponding to the target task information, the method further includes: determining whether a target text to be subjected to word segmentation exists in a target text set acquired in advance; and in response to the fact that the target texts to be subjected to word segmentation exist in the target text set, screening at least one target text to be subjected to word segmentation from the target text set.

Optionally, the method further includes: performing data grouping processing on the at least one target text to obtain at least one target text group; and establishing task information corresponding to each target text group in the at least one target text group to obtain a task information set.

Optionally, the screening out at least one target text to be word-segmented from the target text set includes: acquiring a historical text set of segmented words in a target time period; determining the same texts between the target text set and the historical text set to obtain at least one same text; and removing the at least one same text from the target text set to obtain a removed target text set as the at least one target text.

Optionally, the segmenting the target text by sequentially replacing the lexicon instance information of the first initial lexicon instance according to the sequence of each lexicon instance information in the lexicon instance information sequence includes: for each word stock example information in the word stock example information sequence, executing the following target text word segmentation steps: changing the word stock example information of the current first initial word stock example into the word stock example information to obtain a changed first initial word stock example; initializing the changed first initial lexicon instance; and utilizing the initialized first initial word stock example to perform word segmentation on the corresponding sub-texts in the target text.

Optionally, the obtaining of the word library instance information sequence corresponding to the category information sequence includes: the method comprises the steps of obtaining a word stock example information sequence corresponding to a category information sequence from a target variable, wherein the target variable is a variable sent by a word stock example processing end, the target variable stores a word stock example information set in a first preset key value pair format, and each word stock example information in the word stock example information set has corresponding category information.

Optionally, the information set of the word stock instance in the target variable is generated by the following steps: creating an initial word stock example; loading each word in at least one pre-acquired general word stock to the second initial word stock example to obtain a word stock example after the first loading; loading each word in at least one category word library to be loaded to a corresponding word library example in a target container according to the word library example after the first loading to obtain each word library example after the second loading, wherein the target container stores each word library example in a second preset key value pair format; and determining the word stock example information of each word stock example after the second loading to obtain the word stock example information set.

Optionally, the loading, according to the first loaded thesaurus example, each word in at least one category thesaurus to be loaded to a corresponding thesaurus example in the target container to obtain each second loaded thesaurus example includes: for each word in the at least one category word library to be loaded, executing the following word loading steps: determining category information of the words; determining whether a word stock example using the category information as a key exists in the target container; in response to determining that a thesaurus instance using the category information as a key exists in the target container, loading the words into a target thesaurus instance in the target container, wherein the target thesaurus instance is a thesaurus instance using the category information as a key; and determining each loaded word stock example as each second-loaded word stock example.

Optionally, after determining whether a thesaurus instance using the category information as a key exists in the target container, the method further includes: in response to determining that no thesaurus instance taking the category information as a key exists in the target container, newly building key value contents taking the category information as a key and the thesaurus instance after the first loading as a value in the target container; and loading the words into the word bank example after the first loading in the key value content.

Optionally, the lexicon instance information is information with category information as a key and a target binary group as a value, where the target binary group includes: word list information and word frequency list information.

In a second aspect, some embodiments of the present disclosure provide a text segmentation apparatus, including: a creating unit configured to create a first initial thesaurus instance; a creating unit configured to create a first initial thesaurus instance; the execution unit is configured to execute the text word segmentation step for each target text corresponding to the target task information: and for each target text corresponding to the target task information, executing a thesaurus example information replacement step: acquiring a lexicon instance information sequence corresponding to a category information sequence, wherein each category information in the category information sequence is the category information corresponding to the target text; and sequentially replacing the word stock example information of the first initial word stock example according to the sequence of each word stock example information in the word stock example information sequence to perform word segmentation on the target text.

Optionally, the apparatus further comprises: determining whether a target text to be subjected to word segmentation exists in a target text set acquired in advance; and in response to the fact that the target texts to be subjected to word segmentation exist in the target text set, screening at least one target text to be subjected to word segmentation from the target text set.

Optionally, the apparatus further comprises: performing data grouping processing on the at least one target text to obtain at least one target text group; and establishing task information corresponding to each target text group in the at least one target text group to obtain a task information set.

Optionally, the apparatus further comprises: acquiring a historical text set of segmented words in a target time period; determining the same texts between the target text set and the historical text set to obtain at least one same text; and removing the at least one same text from the target text set to obtain a removed target text set as the at least one target text.

Optionally, the execution unit may be configured to: the method comprises the steps of obtaining a word stock example information sequence corresponding to a category information sequence from a target variable, wherein the target variable is a variable sent by a word stock example processing end, the target variable stores a word stock example information set in a first preset key value pair format, and each word stock example information in the word stock example information set has corresponding category information.

Optionally, the execution unit may be configured to: for each word stock example information in the word stock example information sequence, executing the following target text word segmentation steps: changing the word stock example information of the current first initial word stock example into the word stock example information to obtain a changed first initial word stock example; initializing the changed first initial lexicon instance; and utilizing the initialized first initial word stock example to perform word segmentation on the corresponding sub-texts in the target text.

Optionally, the loading unit is configured to: for each word in the at least one category word library to be loaded, executing the following word loading steps: determining category information of the words; determining whether a word stock example using the category information as a key exists in the target container; in response to determining that a thesaurus instance using the category information as a key exists in the target container, loading the words into a target thesaurus instance in the target container, wherein the target thesaurus instance is a thesaurus instance using the category information as a key; and determining each loaded word stock example as each second-loaded word stock example.

Optionally, the apparatus further comprises: in response to determining that no thesaurus instance taking the category information as a key exists in the target container, newly building key value contents taking the category information as a key and the thesaurus instance after the first loading as a value in the target container; and loading the words into the word bank example after the first loading in the key value content.

Optionally, the word stock instance information includes: word list information corresponding to the word bank examples and word frequency list information corresponding to the word bank examples.

In a third aspect, some embodiments of the present disclosure provide an electronic device, comprising: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.

In a fourth aspect, some embodiments of the disclosure provide a computer readable medium having a computer program stored thereon, where the program when executed by a processor implements a method as described in any of the implementations of the first aspect.

The above embodiments of the present disclosure have the following beneficial effects: the text word segmentation method of some embodiments of the disclosure can rapidly and efficiently segment the target text. Specifically, the reason why the target text cannot be segmented efficiently is that: the Chinese word stock is generally large, and the word segmentation of the target text usually needs to load a plurality of word stocks, so that the problem of long loading time exists, and the word segmentation efficiency of the target text is low. In addition, loading multiple word banks also occupies a large amount of memory resources, resulting in lower efficiency of subsequent word segmentation. Based on this, the text segmentation method of some embodiments of the present disclosure may first create a first initial thesaurus instance as a base thesaurus instance for subsequent generation of the first initial thesaurus instance after modification. Then, for each target text corresponding to the target task information, executing a thesaurus instance information replacement step: and acquiring a word stock example information sequence corresponding to the category information sequence for subsequently replacing the word stock example information of the first initial word stock example. And each category information in the category information sequence is the category information corresponding to the target text. And sequentially replacing the word stock example information of the first initial word stock example according to the sequence of each word stock example information in the word stock example information sequence to perform word segmentation on the target text. The first initial word stock instance can be converted into the word stock instance corresponding to the category information by changing the word stock instance information, and under the condition that the word stock instance corresponding to each category information is not loaded, the word stock instance information of the first initial word stock instance only needs to be converted, so that the corresponding word stock instance can be used for subsequent word segmentation, and the problem of long loading time caused by loading a plurality of word stocks is solved. The side surface also solves the problem that loading a plurality of word banks also occupies a large amount of memory resources.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

FIG. 1 is a schematic diagram of one application scenario of a text-tokenization method according to some embodiments of the present disclosure;

FIG. 2 is a flow diagram of some embodiments of a text-tokenization method according to the present disclosure;

FIG. 3 is a flow diagram of further embodiments of a text-segmentation method according to the present disclosure;

FIG. 4 is a schematic diagram of grouping at least one target text in some embodiments of a text segmentation method according to the present disclosure;

FIG. 5 is a schematic structural diagram of some embodiments of a text-tokenizing apparatus according to the present disclosure;

FIG. 6 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

FIG. 1 is a schematic diagram of one application scenario of a text-tokenization method according to some embodiments of the present disclosure.

In the application scenario of fig. 1, the electronic device 101 may first create a first initial thesaurus instance 102. Then, for each target text corresponding to the target task information 103, performing a lexicon instance information replacement step: first, a thesaurus instance information sequence 106 corresponding to the category information sequence 105 is obtained. Each category information in the category information sequence 105 is the category information corresponding to the target text 1041. In this application scenario, the text set 104 corresponding to the target task information 103 includes: target text 1041, target text 1042, target text 1043. The above-mentioned category information sequence 105 includes: category information 1051, category information 1052, and category information 1053. The thesaurus instance information sequence 106 includes: lexicon instance information 1061, lexicon instance information 1062, lexicon instance information 1063. And a second step of sequentially replacing the lexicon instance information of the first initial lexicon instance according to the sequence of each lexicon instance information in the lexicon instance information sequence 106 to perform word segmentation on the target text 1041. In the application scenario, the lexicon instance information of the first initial lexicon instance 102 is replaced by the lexicon instance information 1061, so as to obtain a first initial lexicon instance 107. The thesaurus instance information 1061 of the first initial thesaurus instance 107 is replaced by thesaurus instance information 1062, resulting in a first initial thesaurus instance 108. The thesaurus instance information 1062 of the first initial thesaurus instance 108 is replaced with thesaurus instance information 1063, resulting in the first initial thesaurus instance 109. The target text 1041 is segmented according to the first initial thesaurus instance 107, the first initial thesaurus instance 108 and the first initial thesaurus instance 109.

The electronic device 101 may be hardware or software. When the electronic device is hardware, the electronic device may be implemented as a distributed cluster formed by a plurality of servers or terminal devices, or may be implemented as a single server or a single terminal device. When the electronic device is embodied as software, it may be installed in the above-listed hardware devices. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

Here, the electronic device 101 may also be a task execution terminal (Excuter terminal) in the computing engine Spark. Among them, Spark can be a fast general purpose computing engine designed specifically for large-scale data processing.

It should be understood that the number of electronic devices in fig. 1 is merely illustrative. There may be any number of electronic devices, as desired for implementation.

With continued reference to fig. 2, a flow 200 of some embodiments of a text-segmentation method according to the present disclosure is shown. The text word segmentation method comprises the following steps:

at step 201, a first initial thesaurus instance is created.

In some embodiments, an executing entity (e.g., the electronic device shown in fig. 1) of the text segmentation method described above may create a first initial thesaurus instance. Wherein the thesaurus instance may be a dictionary tree (Trie tree). For example, the first initialization thesaurus instance described above may be a jieba (jieba) thesaurus instance.

As an example, the execution agent may invoke the target interface to create the ending thesaurus instance.

Step 202, for each target text corresponding to the target task information, performing a thesaurus instance information replacement step:

the target task information may be information characterizing the target task. For example, the target task information may be identification information of the target task. The target task information corresponds to a plurality of target texts. The target task corresponding to the target task information is a task for performing word segmentation processing on a plurality of target texts. As an example, the above target text may be title information of the article.

Step 2021, obtain the thesaurus instance information sequence corresponding to the category information sequence.

In some embodiments, the execution subject may obtain the thesaurus instance information sequence corresponding to the category information sequence in a wired manner or a wireless manner. And each category information in the category information sequence is the category information corresponding to the target text. Each category information in the category information sequence has one-to-one corresponding word stock example information, so that the category information sequence has a corresponding word stock example information sequence. Wherein, the category information can be three-level category information. The lexicon instance information may be key information of the lexicon instance. For example, the thesaurus identification information of the thesaurus instance.

Here, for each category information in the category information sequence, corresponding word library instance information is respectively established, so that subsequent target text word segmentation is more accurate and efficient.

As an example, the following table shows:

first category information	First lexicon instance information
		Second category information	Second thesaurus instance information
Information of the third category	Third thesaurus instance information

It should be noted that at least one item information exists in the target text. The at least one item information is different in the appearance order of the target text. And sequencing the at least one item information according to the appearance sequence of the at least one item information in the target text to obtain an item information sequence. Since each item information has corresponding category information, the item information sequence has a corresponding category information sequence.

In some optional implementation manners of some embodiments, the execution subject may obtain, from the target variable, a thesaurus instance information sequence corresponding to the category information sequence. The target variable is a variable sent by a word stock instance processing end. And the target variable stores the word stock example information set in a first preset key value pair format. And each word stock example information in the word stock example information set has corresponding category information.

Here, the above target variable may be a variable in Spark. The target variable can distribute the data in the thesaurus instance processing end to each Excuter end. The word stock instance processing end may be a driver end in Spark. The first preset key value pair format may be a format in which category information is used as a key and lexicon instance information is used as a value. The driver end can be a driver of Spark, which is a process for executing main method in the development program.

Optionally, the information set of the word stock instance in the target variable is generated by the following steps:

in the first step, the thesaurus instance processing terminal can create a second initial thesaurus instance. The second initialization thesaurus instance may be a jieba (jieba) thesaurus instance.

And secondly, the word stock example processing terminal can load each word in at least one pre-acquired general word stock to the second initial word stock example to obtain the word stock example after the first loading. Each of the at least one general lexicon may be a general lexicon related to the target domain. For example, for the e-commerce domain, each word in the general thesaurus may be a word that is common in the e-commerce domain and is used more frequently.

And thirdly, loading each word in at least one category lexicon to be loaded to a corresponding lexicon example in the target container by the lexicon example processing end in various ways according to the lexicon example after the first loading to obtain each lexicon example after the second loading. And the target container stores each word stock example in a second preset key value pair format. The second preset key value pair format may be a format in which the category information is used as a key and the thesaurus example is used as a key.

As an example, the following table shows:

first category information	First thesaurus instance
		Second category information	Second thesaurus instance
Information of the third category	Third thesaurus example

And fourthly, the word stock example processing end can determine the word stock example information of each word stock example after the second loading to obtain the word stock example information set.

As an example, the thesaurus instance processing terminal may determine thesaurus instance information of each of the second loaded thesaurus instances in a query manner, so as to obtain the thesaurus instance information set.

Optionally, the loading, according to the first loaded thesaurus example, each word in at least one category thesaurus to be loaded to a corresponding thesaurus example in the target container to obtain each second loaded thesaurus example includes:

for each word in the at least one category word library to be loaded, executing the following word loading steps:

in the first sub-step, the thesaurus instance processing terminal can determine the category information of the words.

By way of example, the thesaurus instance processing terminal may determine the category information of the word by means of query.

In the second sub-step, the thesaurus instance processing terminal can determine whether a thesaurus instance using the category information as a key exists in the target container.

In response to determining that a thesaurus instance using the category information as a key exists in the target container, the thesaurus instance processing terminal may load the word to the target thesaurus instance in the target container. The target word stock example is a word stock example taking the category information as a key.

In the fourth substep, the thesaurus instance processing end may determine each loaded thesaurus instance as each second loaded thesaurus instance.

Optionally, after determining whether a thesaurus instance using the category information as a key exists in the target container, the method further includes:

in the first substep, in response to determining that there is no thesaurus instance using the category information as a key in the target container, the thesaurus instance processing end may newly build key value contents using the category information as a key and the thesaurus instance after the first loading as a value in the target container.

As an example, the content stored in the target container includes { [ first category information, first thesaurus instance ], [ second category information, second thesaurus instance ], [ third category information, third thesaurus instance ] }. The above category information may be fourth category information. Because the target does not have a thesaurus example using the category information as a key, the thesaurus example processing end can newly establish key value contents using the category information as a key and the thesaurus example after the first loading as a value in the target container. That is, it can be obtained that the content stored in the target container includes { [ first category information, first thesaurus instance ], [ second category information, second thesaurus instance ], [ third category information, third thesaurus instance ], [ fourth category information, thesaurus instance after first loading ] }.

And a second substep, wherein the thesaurus instance processing end can load the words into the thesaurus instance after the first loading in the key value content.

Step 2022, according to the sequence of each thesaurus instance information in the thesaurus instance information sequence, sequentially replacing the thesaurus instance information of the first initial thesaurus instance to perform word segmentation on the target text.

In some embodiments, the execution subject may sequentially replace the lexicon instance information of the first initial lexicon instance according to an order of each lexicon instance information in the sequence of the lexicon instance information, so as to perform the segmentation on the target text. The first initial word stock example after replacing the word stock example information can perform word segmentation on part of contents in the target text.

In some optional implementation manners of some embodiments, the sequentially replacing the lexicon instance information of the first initial lexicon instance according to the order of each lexicon instance information in the sequence of the lexicon instance information to perform the word segmentation on the target text may include the following steps:

for each lexicon instance information in the sequence of lexicon instance information, the execution subject may perform the following target text segmentation steps:

in the first substep, the execution subject may change the thesaurus instance information of the current first initial thesaurus instance into the thesaurus instance information to obtain a changed first initial thesaurus instance.

In a second sub-step, the executing body may perform initialization processing on the modified first initial thesaurus instance.

By way of example, the execution principal may mark the initialization state (initialized) in the modified first initial thesaurus instance as true.

In the third substep, the executing body may perform word segmentation on the corresponding sub-text in the target text by using the initialized first initial thesaurus instance.

It should be noted that the target text may be composed of various subfolders. The respective sub-texts may be divided according to the appearance position of the respective item information in the target text.

In some optional implementations of some embodiments, the thesaurus instance information comprises: word list information corresponding to the word bank examples and word frequency list information corresponding to the word bank examples. Wherein the thesaurus instance may be a dictionary tree. The word list information corresponding to the thesaurus instance may be address information of a word list corresponding to the dictionary tree. The last word list stores the words corresponding to the dictionary tree. The word frequency list information corresponding to the word library instance may be address information of a word frequency list corresponding to the dictionary tree. The word frequency list stores the frequency of use of each word in the word list.

With further reference to fig. 3, a flow 300 of further embodiments of a text segmentation method according to the present disclosure is shown. The text word segmentation method comprises the following steps:

step 301, determining whether a target text to be word-segmented exists in a target text set acquired in advance.

In some embodiments, an executing subject (e.g., the electronic device shown in fig. 1) may determine whether a target text to be word-segmented exists in a pre-acquired target text set.

As an example, the execution subject may determine whether the target text to be word-segmented exists in the target text set by an uploading time point (or an online time) of each target text in the target text set.

In response to determining that the uploading time of the target text in the target text set is later than the target time point, the execution subject may determine that the target text to be word-segmented exists in the target text set.

In response to determining that the uploading time of the target text in the target text set is earlier than the target time point, the execution subject may determine that the target text to be word-segmented does not exist in the target text set.

Step 302, in response to determining that the target text to be participled exists in the target text set, screening out at least one target text to be participled from the target text set.

In some embodiments, in response to determining that the target text to be word-segmented exists in the target text set, the execution subject may filter out at least one target text to be word-segmented from the target text set.

As an example, the execution subject may filter out at least one target text with an upload time later than a target time point from the target text set as at least one target text to be segmented.

In some optional implementations of some embodiments, the filtering out at least one target text to be word-segmented from the target text set may include:

in the first step, the execution subject may obtain a historical text set of segmented words in a target time period. Wherein the target time period may be a time period before the target text set corresponding time. The target text set corresponding time may be a time when the executing subject acquires the target text set.

Secondly, the execution main body can determine the same text between the target text set and the historical text set to obtain at least one same text;

and thirdly, the executing body can remove the at least one same text from the target text set to obtain a removed target text set as the at least one target text.

Step 303, performing data grouping processing on the at least one target text to obtain at least one target text group.

In some embodiments, the executing entity may perform data grouping processing on the at least one target text to obtain at least one target text group.

As an example, the executing entity may perform data grouping processing on the at least one target text by using a consistent hash algorithm, so as to obtain at least one target text group.

As another example, the execution subject may uniformly group the at least one target text, resulting in at least one target text group. Wherein the number of target texts in each target text group is the same.

Step 304, establishing task information corresponding to each target text group in the at least one target text group to obtain a task information set.

In some embodiments, the execution subject may establish task information corresponding to each of the at least one target text group, resulting in a set of task information. Wherein the task information set includes the target task information. That is, the target task information is any task information in the task information set.

As shown in fig. 4, at least one target text 401 may include: target text 4011, target text 4012, target text 4013, target text 4014, target text 4015, and target text 4016. The at least one target text group includes: a target text group 402, a target text group 403. The target text group 402 includes: target text 4011, target text 4012, target text 4013. The target text 403 includes: target text 4014, target text 4015, target text 4016. The task information corresponding to the target text group 402 may be first task information 404. The task information corresponding to the target text group 403 may be the first task information 405.

Step 305, a first initial thesaurus instance is created.

Step 306, for each target text corresponding to the target task information, executing a thesaurus instance information replacement step:

step 3061, obtain the thesaurus example information sequence corresponding to the category information sequence.

Step 3062, according to the sequence of each word stock example information in the word stock example information sequence, sequentially replacing the word stock example information of the first initial word stock example to perform word segmentation on the target text.

In some embodiments, the specific implementation of

steps

301 and 306 and the technical effects thereof can refer to

steps

201 and 202 in the embodiment corresponding to fig. 2, and are not described herein again.

As can be seen from fig. 3, compared with the description of some embodiments corresponding to fig. 2, the flow 300 of the text word segmentation method in some embodiments corresponding to fig. 3 highlights the specific steps of screening out at least one target text to be segmented and grouping the at least one target text. Therefore, the schemes described in the embodiments perform word segmentation on at least one target text to be subjected to word segmentation through multiple tasks, so that the word segmentation efficiency is greatly improved, and the waste of computing resources is reduced.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides some embodiments of a text segmentation apparatus, which correspond to those of the method embodiments shown in fig. 2, and which may be applied in various electronic devices in particular.

As shown in fig. 5, a text segmentation apparatus 500 includes: a creation unit 501 and an execution unit 503. The creating unit 501 is configured to create a first initial thesaurus instance; an executing unit 502 configured to execute a text word segmentation step for each target text corresponding to the target task information: and for each target text corresponding to the target task information, executing a thesaurus example information replacement step: acquiring a lexicon instance information sequence corresponding to a category information sequence, wherein each category information in the category information sequence is the category information corresponding to the target text; and sequentially replacing the word stock example information of the first initial word stock example according to the sequence of each word stock example information in the word stock example information sequence to perform word segmentation on the target text.

In some optional implementations of some embodiments, the apparatus 500 further includes: a determination unit and a screening unit (not shown in the figure). Wherein the determining unit may be configured to: and determining whether a target text to be subjected to word segmentation exists in a target text set acquired in advance. The screening unit may be configured to: and in response to the fact that the target texts to be subjected to word segmentation exist in the target text set, screening at least one target text to be subjected to word segmentation from the target text set.

In some optional implementations of some embodiments, the apparatus 500 further includes: a packet processing unit and a setup unit (not shown). Wherein the packet processing unit may be configured to: and performing data grouping processing on the at least one target text to obtain at least one target text group. The establishing unit may be configured to: and establishing task information corresponding to each target text group in the at least one target text group to obtain a task information set.

In some optional implementations of some embodiments, the screening unit in the apparatus 500 may be further configured to: acquiring a historical text set of segmented words in a target time period; determining the same texts between the target text set and the historical text set to obtain at least one same text; and removing the at least one same text from the target text set to obtain a removed target text set as the at least one target text.

In some optional implementations of some embodiments, the obtaining unit in the apparatus 500 may be further configured to: the method comprises the steps of obtaining a word stock example information sequence corresponding to a category information sequence from a target variable, wherein the target variable is a variable sent by a word stock example processing end, the target variable stores a word stock example information set in a first preset key value pair format, and each word stock example information in the word stock example information set has corresponding category information.

In some optional implementations of some embodiments, the word segmentation unit in the apparatus 500 described above may be further configured to: for each word stock example information in the word stock example information sequence, executing the following target text word segmentation steps: changing the word stock example information of the current first initial word stock example into the word stock example information to obtain a changed first initial word stock example; initializing the changed first initial lexicon instance; and utilizing the initialized first initial word stock example to perform word segmentation on the corresponding sub-texts in the target text.

In some optional implementations of some embodiments, the set of repository instance information in the target variable is generated by: creating a second initial word stock example; loading each word in at least one pre-acquired general word stock to the second initial word stock example to obtain a word stock example after the first loading; loading each word in at least one category word library to be loaded to a corresponding word library example in a target container according to the word library example after the first loading to obtain each word library example after the second loading, wherein the target container stores each word library example in a second preset key value pair format; and determining the word stock example information of each word stock example after the second loading to obtain the word stock example information set.

In some optional implementations of some embodiments, the loading unit in the apparatus 500 may be configured to: for each word in the at least one category word library to be loaded, executing the following word loading steps: determining category information of the words; determining whether a word stock example using the category information as a key exists in the target container; in response to determining that a thesaurus instance using the category information as a key exists in the target container, loading the words into a target thesaurus instance in the target container, wherein the target thesaurus instance is a thesaurus instance using the category information as a key; and determining each loaded word stock example as each second-loaded word stock example.

In some optional implementations of some embodiments, the apparatus 500 further includes: a key value content new unit and a word loading unit (not shown in the figure). Wherein the key-value content creation unit may be configured to: and in response to determining that the target container does not have a thesaurus instance taking the category information as a key, newly building key value contents taking the category information as a key and the thesaurus instance after the first loading as a value in the target container. The word loading unit may be configured to: and loading the words into the word bank example after the first loading in the key value content.

In some optional implementations of some embodiments, the thesaurus instance information includes: word list information corresponding to the word bank examples and word frequency list information corresponding to the word bank examples.

In some optional implementations of some embodiments, the apparatus 500 further includes: it is to be understood that the elements described in the apparatus 500 correspond to various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 500 and the units included therein, and are not described herein again.

Referring now to FIG. 6, a block diagram of an electronic device (e.g., the electronic device of FIG. 1) 600 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network through the communication device 609, or installed from the storage device 608, or installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of some embodiments of the present disclosure.

It should be noted that the computer readable medium described above in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: creating a first initial thesaurus instance; and for each target text corresponding to the target task information, executing a thesaurus example information replacement step: acquiring a lexicon instance information sequence corresponding to a category information sequence, wherein each category information in the category information sequence is the category information corresponding to the target text; and sequentially replacing the word stock example information of the first initial word stock example according to the sequence of each word stock example information in the word stock example information sequence to perform word segmentation on the target text.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by software, and may also be implemented by hardware. The described units may also be provided in a processor, and may be described as: a processor includes a creation unit and an execution unit. Where the names of these units do not in some cases constitute a limitation on the unit itself, for example, a creating unit may also be described as a "unit that creates a first initial thesaurus instance".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A text word segmentation method, comprising:

creating a first initial thesaurus instance;

and for each target text corresponding to the target task information, executing a thesaurus example information replacement step:

acquiring a lexicon instance information sequence corresponding to a category information sequence, wherein each category information in the category information sequence is the category information corresponding to the target text;

and sequentially replacing the word stock example information of the first initial word stock example according to the sequence of each word stock example information in the word stock example information sequence to perform word segmentation on the target text.

2. The method of claim 1, wherein before performing thesaurus instance information replacement for each target text to which target task information corresponds, the method further comprises:

determining whether a target text to be subjected to word segmentation exists in a target text set acquired in advance;

and in response to determining that the target text to be subjected to word segmentation exists in the target text set, screening at least one target text to be subjected to word segmentation from the target text set.

3. The method of claim 2, wherein the method further comprises:

performing data grouping processing on the at least one target text to obtain at least one target text group;

and establishing task information corresponding to each target text group in the at least one target text group to obtain a task information set.

4. The method of claim 2, wherein the screening out at least one target text to be word-segmented from the set of target texts comprises:

acquiring a historical text set of segmented words in a target time period;

determining the same text between the target text set and the historical text set to obtain at least one same text;

and removing the at least one same text from the target text set to obtain a removed target text set as the at least one target text.

5. The method according to claim 1, wherein the obtaining of the thesaurus instance information sequence corresponding to the category information sequence comprises:

the method comprises the steps of obtaining a word stock example information sequence corresponding to a category information sequence from a target variable, wherein the target variable is a variable sent by a word stock example processing end, the target variable stores a word stock example information set in a first preset key value pair format, and each word stock example information in the word stock example information set has corresponding category information.

6. The method of claim 1, wherein the segmenting the target text by sequentially replacing the lexicon instance information of the first initial lexicon instance according to the order of each lexicon instance information in the sequence of lexicon instance information comprises:

for each word stock instance information in the word stock instance information sequence, executing the following target text word segmentation steps:

changing the word stock example information of the current first initial word stock example into the word stock example information to obtain a changed first initial word stock example;

initializing the changed first initial lexicon instance;

and utilizing the initialized first initial word stock example to perform word segmentation on the corresponding sub-texts in the target text.

7. The method of claim 5, wherein the set of lexicon instance information in the target variable is generated by:

creating a second initial word stock example;

loading each word in at least one pre-acquired general word stock to the second initial word stock example to obtain a word stock example after the first loading;

loading each word in at least one category word library to be loaded to a corresponding word library example in a target container according to the word library example after the first loading to obtain each word library example after the second loading, wherein the target container stores each word library example in a second preset key value pair format;

and determining the word stock example information of each word stock example after the second loading to obtain the word stock example information set.

8. The method of claim 7, wherein the loading, according to the first loaded thesaurus instance, each word in at least one category thesaurus to be loaded to a corresponding thesaurus instance in a target container to obtain each second loaded thesaurus instance comprises:

for each word in the at least one category word library to be loaded, performing the following word loading steps:

determining category information for the word;

determining whether a word stock example using the category information as a key exists in the target container;

in response to determining that a thesaurus instance using the category information as a key exists in the target container, loading the word to a target thesaurus instance in the target container, wherein the target thesaurus instance is a thesaurus instance using the category information as a key;

and determining each loaded word stock example as each second-loaded word stock example.

9. The method of claim 8, wherein after said determining whether a thesaurus instance keyed by said category information exists in said target container, said method further comprises:

in response to determining that no thesaurus instance taking the category information as a key exists in the target container, newly creating key value contents taking the category information as a key and the thesaurus instance after the first loading as a value in the target container;

and loading the word to the word bank example after the first loading in the key value content.

10. The method of claim 1, wherein the thesaurus instance information comprises: word list information corresponding to the word bank examples and word frequency list information corresponding to the word bank examples.

11. A text segmentation apparatus comprising:

a creating unit configured to create a first initial thesaurus instance;

the execution unit is configured to execute the text word segmentation step for each target text corresponding to the target task information: and for each target text corresponding to the target task information, executing a thesaurus example information replacement step: acquiring a lexicon instance information sequence corresponding to a category information sequence, wherein each category information in the category information sequence is the category information corresponding to the target text; and sequentially replacing the word stock example information of the first initial word stock example according to the sequence of each word stock example information in the word stock example information sequence to perform word segmentation on the target text.

12. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-10.

13. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-10.