CN114548089A

CN114548089A - Text word segmentation method based on big data and related equipment thereof

Info

Publication number: CN114548089A
Application number: CN202210086271.8A
Authority: CN
Inventors: 李斌
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Ping An International Smart City Technology Co Ltd
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-05-27

Abstract

The embodiment of the application belongs to the field of big data and artificial intelligence, and relates to a text word segmentation method, a text word segmentation device, computer equipment and a storage medium based on big data, wherein the method comprises the steps of preprocessing a text to be processed to obtain a corpus consisting of Chinese character type lemmas and non-Chinese character type lemmas; putting the non-Chinese character type word elements in the corpus into a unary candidate word library; obtaining a binary text segment set, a ternary text segment set and a quaternary text segment set by using a front-back splicing word segmentation method; and deleting the binary candidate words, the ternary candidate words and the quaternary candidate words which meet preset conditions in the binary text segment set, the ternary text segment set and the quaternary text segment set to obtain a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank. The application also relates to a block chain technology, wherein the unary candidate word bank, the binary candidate word bank, the ternary candidate word bank and the quaternary candidate word bank are stored in the block chain. The method and the device shorten the length of the candidate word bank.

Description

Text word segmentation method based on big data and related equipment thereof

Technical Field

The application relates to the technical field of big data and artificial intelligence, in particular to a text word segmentation method and device based on big data, computer equipment and a storage medium.

Background

With the coming of economic globalization, communication at home and abroad becomes frequent, and a large amount of Chinese and English are mixed, so that the requirement for extracting words from a Chinese and English mixed corpus is gradually increased. In the prior art, Chinese and English text segments are all cut in a Chinese character mode, so that a corpus is expanded, and the calculation amount is large.

Disclosure of Invention

The embodiment of the application aims to provide a text word segmentation method and device based on big data, a computer device and a storage medium, so as to solve the problems of corpus expansion and large calculation amount.

In order to solve the above technical problem, an embodiment of the present application provides a text word segmentation method based on big data, which adopts the following technical solutions:

preprocessing a text to be processed to obtain a corpus consisting of Chinese character type lemmas and non-Chinese character type lemmas;

putting the non-Chinese character type word elements in the corpus into a unary candidate word library;

carrying out front-back splicing word segmentation on the Chinese character type word elements in the corpus by utilizing binary word segmentation, ternary word segmentation and quaternary word segmentation to obtain a binary text segment set, a ternary text segment set and a quaternary text segment set;

and deleting the binary candidate words, the ternary candidate words and the quaternary candidate words which meet preset conditions in the binary text segment set, the ternary text segment set and the quaternary text segment set to obtain a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank.

In order to solve the above technical problem, an embodiment of the present application further provides a text word segmentation device based on big data, which adopts the following technical solutions:

the processing module is used for preprocessing the text to be processed to obtain a corpus consisting of Chinese character type lemmas and non-Chinese character type lemmas;

the induction module is used for putting the non-Chinese character type word elements in the corpus into a unary candidate word bank;

the word segmentation module is used for splicing and segmenting words of Chinese character types in the corpus in a front-back manner by utilizing binary word segmentation, ternary word segmentation and quaternary word segmentation to obtain a binary text segment set, a ternary text segment set and a quaternary text segment set;

and the deleting module is used for deleting the binary candidate words, the ternary candidate words and the quaternary candidate words which meet preset conditions in the binary text segment set, the ternary text segment set and the quaternary text segment set to obtain a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the text word segmentation method based on big data when executing the computer program.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the text word segmentation method based on big data.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects: the method comprises the steps of preprocessing a text to be processed, identifying Chinese character lemmas and non-Chinese character lemmas in the text to be processed, marking the types of the lemmas to obtain a corpus consisting of the lemmas of Chinese character types and non-Chinese character types, separating non-Chinese character phrases as one lemma, and screening out English phrases with meaningless letters, wherein although the length of a non-Chinese character candidate word bank is increased, the length of the whole candidate word bank is reduced along with the removal of non-Chinese character characters, and the beneficial effect of reducing the length of the whole candidate word bank is achieved.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a big-data based text-tokenization method according to the present application;

FIG. 3 is a schematic block diagram of an embodiment of a big data based text segmentation apparatus according to the present application;

FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use

terminal devices

101, 102, 103 to interact with a server 105 over a network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the text word segmentation method based on big data provided in the embodiments of the present application is generally executed by a server, and accordingly, a text word segmentation apparatus based on big data is generally disposed in the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow diagram of one embodiment of a big-data based text-tokenization method in accordance with the present application is shown. The text word segmentation method based on the big data comprises the following steps:

step S201, preprocessing a text to be processed to obtain a corpus composed of Chinese character type lemmas and non-Chinese character type lemmas;

in this embodiment, an electronic device (for example, a server shown in fig. 1) on which the text word segmentation method based on big data operates may communicate with the terminal through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

The text to be processed can be a text formed by mixing Chinese characters and non-Chinese characters such as English words, numbers and the like, wherein the text to be processed can be a pure Chinese character text such as a literature classical work, a time-of-day paper chapter, a scientific paper and the like; the preprocessing comprises removing blank spaces, punctuation marks and the like, and also comprises identifying non-Chinese characters and marking Chinese characters and non-Chinese characters in the text. The word element type comprises a Chinese character type and a non-Chinese character type. And after the blank space and punctuation mark are removed from the text to be processed and the type of the word element is marked, a language database consisting of the word elements of the Chinese character type and the word elements of the non-Chinese character type is obtained. The memory space occupied by the word elements of the Chinese character type is larger than that occupied by the word elements of the non-Chinese character type, and the word elements of the Chinese character type and the word elements of the non-Chinese character type can be identified through the occupied memory space.

Step S202, putting the non-Chinese character type word elements in the corpus into a unary candidate word library;

the single English letter is meaningless, and the non-Chinese character type word elements in the corpus, such as English word group spring, number 3 and the like, are placed into the unary candidate word library to screen out the non-Chinese characters, so that although the length of the non-Chinese character word elements is increased, the unary candidate word library is increased, meaningless English spliced word groups are screened out, the length of the whole candidate word library is shortened, the integral calculation capacity is optimized, and the word segmentation efficiency is improved.

Step S203, carrying out front-back splicing word segmentation on the Chinese character type word elements in the corpus by utilizing binary word segmentation, ternary word segmentation and quaternary word segmentation to obtain a binary text segment set, a ternary text segment set and a quaternary text segment set;

and segmenting the word elements of the Chinese character types in the pre-material library by utilizing a binary segmentation and pre-segmentation and post-splicing method to obtain a binary text segment set. If corpus is { a_i,a_i+1,a_i+2,a_i+3,a_i+4,a_i+5.......a_i-1+n,a_i+nN belongs to N, and the binary text segment set is { (a)_ia_i+1),(a_i+1a_i+2),(a_i+2a_i+3) ... }; in the same way, utilize threeSegmenting the word elements of the Chinese character type in the corpus by a method of splicing before and after segmentation of element segmentation words to obtain a ternary text segment set, wherein the ternary text segment set is { (a)_ia_i+ ₁a_i+2),(a_i+1a_i+2a_i+3),(a_i+2a_i+3a_i+4) .., i ∈ N; utilizing a four-element word segmentation pre-and-post splicing method to segment the word elements of the Chinese character types in the corpus to obtain a four-element text segment set, wherein the four-element text segment set is { (a) }_ia_i+1a_i+2a_i+3),(a_i+1a_i+2a_i+3a_i+4),(a_i+2a_i+3a_i+4a_i+5)...},i∈N。

And step S204, deleting the binary candidate words, the ternary candidate words and the quaternary candidate words which meet preset conditions in the binary text segment set, the ternary text segment set and the quaternary text segment set to obtain a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank.

And processing the corpus by a segmentation and concatenation word segmentation method to obtain a binary text segment set, a ternary text segment set and a quaternary text segment set. The words in the binary text segment set are binary candidate words which comprise two lemmas, and similarly, the ternary candidate words comprise three lemmas, and the quaternary candidate words comprise four lemmas. The preset condition may include that the frequency of occurrence of candidate words in the corpus in each text segment set is less than a preset frequency set for the text segment set. The preset conditions may further include that one or both of the degree of freedom and the degree of condensation of the candidate words in each text segment set are less than the corresponding threshold values. And deleting the candidate words in each text segment set when the candidate words meet the preset conditions. And eliminating the candidate words which do not accord with the phrase conditions to obtain a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank. It should be noted that elements in the set cannot be repeated, and repeated candidate words can be automatically filtered to reduce the lemmas in the candidate word library and avoid occupying space.

In this embodiment, a text to be processed is preprocessed to identify chinese character lemmas and non-chinese character lemmas in the text, a corpus is obtained by labeling the types of the lemmas, the non-chinese character phrases are separated as one lemma, and english phrases with meaningless letters are screened out.

It should be emphasized that, in order to further ensure the privacy and security of the candidate thesaurus, the unary candidate thesaurus, the binary candidate thesaurus, the ternary candidate thesaurus and the quaternary candidate thesaurus may also be stored in a node of a block chain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Further, the step S201 may include:

step S2011, a preset function is called to convert the text to be processed into a byte stream, and characters in the byte stream are read sequentially;

step S2012, marking the character type of the character according to the storage space occupied by each character in the byte stream, wherein the character type comprises a Chinese character type and a non-Chinese character type;

step S2013, converting full-angle characters into half-angle characters according to the Unicode codes of the characters;

and step S2014, writing the converted half-angle characters and the half-angle characters of the text to be processed into a new text, and replacing the text to be processed to obtain the corpus.

Firstly, reading a text to be processed, wherein a preset function is a byte stream conversion function, such as a newbyte function, and sequentially reading characters in the byte stream. The storage space occupied by each character in the character stream can be read, and the character type, the Chinese character type or the non-Chinese character type can be marked according to the storage space occupied by each character. After the english alphabet is recognized, if there are a plurality of english alphabets, the english characters are recognized according to the front-back combination of the english alphabets and the matching degree with the english dictionary, and for example, "s", "p", "r", "i", "n", and "g" may be combined into an english character "spring".

Reading the characters in the byte stream according to the sequence of the byte stream, converting according to the size of the Unicode code of the marked characters, and converting full-angle characters into half-angle characters. According to the full angle: it refers to various symbols in Chinese GB2312-80 (Chinese character coding character set and basic set for information exchange). Half angle: refers to various symbols in the english ASCII. The half-angle character in the text to be processed is not processed

After the character is marked, the mark type is kept, for example, the character type coincidence is marked on the upper mark or the lower mark of the character, and the character can be kept after the character is converted. And writing the half-angle characters into a new text, marking the half-angle characters in the new text with character types, and replacing the new text with the half-angle characters which are not converted in the text to be processed to obtain a corpus. By converting the full-angle characters into the half-angle characters, Chinese characters and non-Chinese characters in the new text can be accurately identified, and Chinese character types or non-Chinese character types are added to the word elements in the new text to obtain a corpus.

In this embodiment, a corpus consisting of chinese type lemmas and non-chinese type lemmas is obtained by converting full-angle characters into half-angle characters, accurately identifying chinese characters and non-chinese type lemmas according to storage space occupied by the characters in the byte stream, and labeling the lemmas.

Further, step S2013 may include the steps of:

step S20131, when the Unicode code of the character is greater than or equal to U + FF01 and less than or equal to U + FF5E, the marked character is a full-angle character, and 65248 is subtracted from the Unicode code of the full-angle character to obtain a corresponding marked half-angle character;

and S20132, when the code of the character is equal to U +3000, the marked character is a full-angle character, the code of the full-angle character is converted into U +0020, and a corresponding marked half-angle character is obtained.

When the Unicode code of the character is more than or equal to U + FF01 and less than or equal to U + FF5E, the marked character is a full-angle character, 65248 is subtracted from the full-angle character, that is, the full-angle character is converted into a half-angle character, and the half-angle character corresponding to the character is obtained. When the code of the character is equal to U +3000, the marked character is indicated as a space in the full-angle character, and the code of the character is converted into U +0020 to become a half-angle character. And when the Unicode code of the marked characters is not in the coding interval, the characters are not further processed. The conversion of full-angle characters to half-angle characters can be realized by using a DBC case function.

In this embodiment, the format of the text base is unified by adding the preprocessing of converting full angles into half angles so as to match with the dictionary base, and the non-chinese character lemmas in the mixed chinese and english text are accurately identified according to the original Unicode codes of the characters, so that the chinese characters and the non-chinese character lemmas in the new text are accurately marked, and the unary candidate lexicon formed by the non-chinese character lemmas is accurately screened out.

Further, step S204 may include the steps of:

step S2041, acquiring binary candidate words, ternary candidate words and quaternary candidate words in the binary text segment set, the ternary text segment set and the quaternary text segment set;

step S2042, respectively obtaining the frequency of each binary candidate word in the corpus, the frequency of each ternary candidate word in the corpus, and the frequency of each quaternary candidate word in the corpus;

step S2043, when the frequencies of the binary candidate words, the ternary candidate words and the quaternary candidate words are less than the corresponding preset frequencies, the preset conditions are met, the binary candidate words, the ternary candidate words and the quaternary candidate words meeting the preset conditions are deleted, and a binary candidate word library, a ternary candidate word library and a quaternary candidate word library are obtained, wherein the preset conditions comprise that one or more of the frequencies of the binary candidate words, the ternary candidate words and the quaternary candidate words are less than the corresponding preset frequencies.

Acquiring binary candidate words, ternary candidate words and quaternary candidate words in the binary text segment set, the ternary text segment set and the quaternary text segment set; and acquiring the occurrence frequency of each candidate word in the prediction library, and when the frequency is low, indicating that the candidate word does not form a phrase, and the combination of the candidate words is meaningless and needs to be removed. The frequency of the candidate words refers to the ratio of the number of occurrences of the candidate words in the corpus to the number of all candidate words of the same type in the corpus, for example, the frequency of the binary candidate words is the ratio of the number of occurrences of the binary candidate words in the corpus to the number of all binary candidate words in the corpus.

In this embodiment, the preset conditions include one or more of that the binary candidate word is less than the preset frequency of the binary candidate word, that the ternary candidate word is less than the preset frequency of the ternary candidate word, and that the quaternary candidate word is less than the preset frequency of the quaternary candidate word.

It should be noted that the preset frequencies corresponding to each text segment set are different, and the preset frequencies can be set by user, and the values of the preset frequencies can be adjusted according to the actual situation, so as to precisely screen out the phrases. And if the probability of the candidate word appearing in the corpus is smaller than the preset probability of the corresponding text segment set, the preset condition is met.

And if the binary candidate words meet the preset conditions, deleting all the binary candidate words meeting the preset conditions to obtain a binary candidate word library. The ternary candidate word and the quaternary candidate word are also processed in this way, and are not described herein.

In the embodiment, by acquiring the frequency of the candidate words in the corpus and eliminating the meaningless candidate words which do not form phrases according to the frequency of the candidate words, valuable lemmas are accurately screened out, and the length of the candidate lexicon is shortened.

Further, step S204 may further include the steps of:

step S2044, acquiring binary candidate words, ternary candidate words and quaternary candidate words in the binary text segment set, the ternary text segment set and the quaternary text segment set;

step S2045, obtaining the degree of freedom and the degree of condensation of each binary candidate word, ternary candidate word and quaternary candidate word;

step S2046, if the degrees of freedom of the binary candidate words, the ternary candidate words, and the quaternary candidate words are less than the preset degrees of freedom corresponding to the candidate words, or the degrees of aggregation of the binary candidate words, the ternary candidate words, and the quaternary candidate words are less than the preset degrees of aggregation corresponding to the candidate words, satisfying the preset condition, and deleting the binary candidate words, the ternary candidate words, and the quaternary candidate words that satisfy the preset condition, thereby obtaining a binary candidate lexicon, a ternary candidate lexicon, and a quaternary candidate lexicon.

In this embodiment, the degrees of freedom may be: when a text segment appears in various different text sets and has a left adjacent word set and a right adjacent word set, wherein the left adjacent word set is a set indicating adjacent characters on the left side of the text segment, the right adjacent word set is a set indicating adjacent characters on the right side of the text segment, the information entropy of the text segment is obtained by calculating the information entropy of the left adjacent word set and the right adjacent word set, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom. The degree of cohesion is the ratio of the probability of a candidate word appearing in the corpus to the probability of a single word element appearing in the candidate word. The preset degree of freedom and the preset degree of condensation corresponding to each text segment set are different and can be set according to requirements. And when the degree of freedom of the candidate word is less than the preset degree of freedom of the text segment set corresponding to the candidate word, or the degree of condensation of the candidate word is less than the preset degree of freedom of the text segment set corresponding to the candidate word, the preset condition is met.

And when the binary candidate words meet the preset conditions, deleting all candidate words meeting the preset conditions from the binary text segment set to obtain a binary candidate word library. The same is true for the ternary candidate word and the quaternary candidate word, and the description is omitted.

In the embodiment, the relation between the characters in the candidate words is reflected by calculating the degree of freedom and the degree of condensation of the candidate words, and if the degree of freedom and the degree of condensation are higher, the probability of word formation of the candidate words is higher, so that the effectiveness of word formation of the multi-element candidate word library is improved, the word elements in the preset word library are further reduced, and word segmentation is performed more accurately.

Further, the degree of cohesion of the candidate words is equal to the ratio of the probability of the candidate words appearing in the corpus to the probability of a single word element in the candidate words in the corpus; the degree of freedom H of the candidate word is as follows:

H＝min{s’,s”}；

wherein,

h represents the degree of freedom of the candidate word, s 'represents the right entropy of the candidate word, s' is the left entropy of the candidate word, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom; b is a mixture of_iRight neighbourhood of candidate words, n_biDenotes b_iThe frequency number appearing on the right side of the candidate word, K represents the number of character elements in the right adjacent character set of the candidate word and is the left entropy of the candidate word; m is_iLeft adjacent word set belonging to candidate words, n_miRepresents m_iThe frequency number appearing on the left side of the candidate word, and M represents the number of the character elements in the left adjacent character set of the candidate word.

Taking the degree of cohesion of binary candidate words as an example, M is the degree of cohesion,

wherein, M is the degree of aggregation, and P (A), P (B) are the probabilities of single word elements in the candidate words in the corpus; p (ab) is the probability of the candidate word appearing in the corpus.

If the candidate word formation is sufficiently distributed in the text segment set, the degree of cohesion of the candidate word forming the word is relatively high and the degree of freedom is greater than that of the candidate word not forming the word. If the left and right adjacent character sets of the candidate word are regarded as random variables, the information entropy of the left and right adjacent character sets of the candidate word reflects the randomness of the left and right adjacent characters of the character, the larger the entropy value is, the richer the left adjacent character set or the right adjacent character set of the character is, and the smaller entropy in the left and right adjacent character sets is taken as the degree of freedom. And the candidate word of the formed word has higher degree of cohesion, which indicates that the relation degree between the characters in the candidate word and the characters is tight, and when the degree of cohesion is calculated, the degree of cohesion which is smaller is taken as the degree of cohesion of the candidate word.

In the embodiment, candidate words with poor distribution or loose relation between characters in the candidate words are screened out by obtaining the distribution degree of the candidate words in the left and right adjacent character sets and the relation degree between the characters in the candidate words, and the candidate words in the candidate word library are further optimized to obtain the truly valuable and meaningful candidate words.

Further, before step S204, the method further includes the following steps:

traversing candidate words in each text fragment set;

step S205, determining whether the candidate word is included in the dictionary lexicon corresponding to the candidate word;

in step S206, if the candidate word is included in the corresponding dictionary lexicon, the candidate word is deleted from the corresponding text segment set.

In this embodiment, a standardized dictionary thesaurus is called, the dictionary thesaurus is an existing standardized word group dictionary, the binary text fragment set corresponds to the binary dictionary thesaurus, the ternary text fragment set corresponds to the ternary dictionary thesaurus, and the quaternary text fragment set corresponds to the quaternary dictionary thesaurus. And through traversing the candidate words in each text segment set, judging whether the candidate words belong to the dictionary word bank corresponding to the candidate words, and if the candidate words are contained in the corresponding dictionary word bank, deleting the candidate words from the corresponding text segment set. And deleting repeated word elements in the existing word stock dictionary, and further reducing the length of the candidate word stock.

In this embodiment, the candidate words in the text segment set are compared with the dictionary lexicon, and the candidate words already contained in the dictionary lexicon are screened out, so that the number of the candidate words in the text segment set is greatly reduced, the calculation amount is reduced, and the working efficiency is improved.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

As shown in fig. 3, the big data-based text word segmentation apparatus 300 according to this embodiment includes: a processing module 301, a summarization module 302, a segmentation module 303, and a deletion module 304, wherein:

the processing module 301 is configured to pre-process a text to be processed to obtain a corpus composed of chinese character type lemmas and non-chinese character type lemmas;

induction module 302, for placing the non-Chinese character type word elements in the corpus into a unary candidate word library;

the word segmentation module 303 is configured to perform front-back splicing word segmentation on the Chinese character type word elements in the corpus by using binary word segmentation, ternary word segmentation and quaternary word segmentation to obtain a binary text segment set, a ternary text segment set and a quaternary text segment set;

a deleting module 304, configured to delete the binary candidate words, the ternary candidate words, and the quaternary candidate words that satisfy preset conditions in the binary text segment set, the ternary text segment set, and the quaternary text segment set, to obtain in this embodiment, a text to be processed is preprocessed, a chinese word unit and a non-chinese word unit in the text to be processed are identified, after a word unit type is marked, a corpus formed by the chinese word unit and the non-chinese word unit is obtained, a non-chinese word group is used as a word unit to be separated, and an english word group with meaningless letters is screened out.

In some optional implementations of this embodiment, the processing module 301 includes: a conversion subunit and a labeling subunit, wherein:

the reading subunit is used for calling a preset function to convert the text to be processed into a byte stream and sequentially reading characters in the byte stream;

the marking subunit is used for marking the character types of the characters according to the storage space occupied by each character in the byte stream, wherein the character types comprise Chinese character types and non-Chinese character types;

the conversion subunit is used for converting the full-angle character into the half-angle character according to the Unicode code of each character;

and the writing subunit is used for writing the converted half-angle characters and the half-angle characters of the text to be processed into a new text, and replacing the text to be processed to obtain the corpus.

In some optional implementations of this embodiment, the converting subunit includes: a Chinese character conversion subunit and a space conversion subunit, wherein:

a kanji conversion subunit, configured to, when a Unicode code of the character is greater than or equal to U + FF01 and less than or equal to U + FF5E, determine that the marked character is a full-angle character, and subtract 65248 from the Unicode code of the full-angle character to obtain a corresponding marked half-angle character;

and the space converting subunit is used for converting the code of the full-angle character into U +0020 to obtain the corresponding half-angle and half-angle character after marking when the code of the character is equal to U +3000 and the marked character is a full-angle character.

In some optional implementations of this embodiment, the deleting module includes: a lemma obtaining subunit, a frequency obtaining subunit and a comparing subunit, wherein:

the word element acquiring subunit is used for acquiring binary candidate words, ternary candidate words and quaternary candidate words in the binary text segment set, the ternary text segment set and the quaternary text segment set;

the frequency acquisition subunit is configured to acquire the frequency of each binary candidate word in the corpus, the frequency of each ternary candidate word in the corpus, and the frequency of each quaternary candidate word in the corpus, respectively;

and the comparison subunit is configured to, when the frequencies of the binary candidate words, the ternary candidate words, and the quaternary candidate words are less than the corresponding preset frequencies, satisfy the preset conditions, delete the binary candidate words, the ternary candidate words, and the quaternary candidate words that satisfy the preset conditions, and obtain a binary candidate word bank, a ternary candidate word bank, and a quaternary candidate word bank, where the preset conditions include that one or more of the frequencies of the binary candidate words, the ternary candidate words, and the quaternary candidate words are less than the corresponding preset frequencies.

In some optional implementations of this embodiment, the deleting module further includes: a relationship acquisition subunit and a relationship comparison subunit, wherein:

the calculation subunit is used for acquiring the degree of freedom and the degree of condensation of each candidate word;

and the relation comparison subunit is configured to, if the degree of freedom of the candidate word is less than a preset degree of freedom of the candidate word or the degree of aggregation of the candidate word is less than a preset degree of aggregation corresponding to the candidate word, satisfy the preset condition.

In some optional implementation manners of this embodiment, the relationship obtaining subunit further includes a calculating subunit, where:

the calculating subunit is used for calculating that the degree of cohesion of the candidate words is equal to the ratio of the product of the probability of the candidate words appearing in the corpus and the probability of single word elements in the candidate words in the corpus; the degree of freedom H of the candidate word is as follows:

H＝min{s’,s”}；

wherein,

h represents the degree of freedom of the candidate word, s 'represents the right entropy of the candidate word, s' is the left entropy of the candidate word, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom; b_iRight neighbourhood of candidate words, n_biDenotes b_iThe frequency number appearing on the right side of the candidate word, K represents the number of character elements in the right adjacent character set of the candidate word and is the left entropy of the candidate word; m is_iLeft adjacent word set belonging to candidate words, n_miRepresents m_iThe frequency number appearing on the left side of the candidate word, and M represents the number of the character elements in the left adjacent character set of the candidate word.

In some optional implementations of the present embodiment, the big data based text word segmentation apparatus 300 further includes a determining module and a selecting module, wherein:

the judging module is used for judging whether the candidate words are contained in the dictionary word stock corresponding to the candidate words;

and the selection module is used for deleting the candidate words from the corresponding text fragment set if the candidate words are contained in the corresponding dictionary word bank.

In this embodiment, the candidate words in the text segment set are compared with the dictionary lexicon, and the candidate words already contained in the dictionary lexicon are screened out, so that the number of the candidate words in the text segment set is greatly reduced, the computation amount is reduced, and the working efficiency is improved.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 9, fig. 9 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only computer device 4 having components 41-43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as computer readable instructions of a text word segmentation method based on big data. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as executing computer readable instructions of the big data based text word segmentation method.

The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.

The computer device provided in this embodiment may perform the steps of the text word segmentation method based on big data. The steps of the big data based text word segmentation method herein may be the steps in the big data based text word segmentation methods of the various embodiments described above.

In this embodiment, a text to be processed is preprocessed, chinese and non-chinese character lemmas in the text to be processed are identified, a corpus composed of chinese character type lemmas and non-chinese character type lemmas is obtained after the lemmas are labeled, non-chinese character phrases are separated as one lemma, and english phrases with meaningless letters are screened out.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the big-data based text segmentation method as described above.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A text word segmentation method based on big data is characterized by comprising the following steps:

2. The text word segmentation method based on big data as claimed in claim 1, wherein the step of preprocessing the text to be processed to obtain a corpus consisting of chinese type lemmas and non-chinese type lemmas comprises:

calling a preset function to convert the text to be processed into a byte stream, and sequentially reading characters in the byte stream;

marking the character type of the character according to the storage space occupied by each character in the byte stream, wherein the character type comprises a Chinese character type and a non-Chinese character type;

converting full-angle characters into half-angle characters according to the Unicode codes of the characters;

and writing the converted half-angle characters and the half-angle characters of the text to be processed into a new text, and replacing the text to be processed to obtain the corpus.

3. The big data based text segmentation method according to claim 2, wherein the step of converting full-size characters into half-size characters according to the Unicode code of each of the characters comprises:

when the Unicode code of the character is greater than or equal to U + FF01 and less than or equal to U + FF5E, the marked character is a full-angle character, and 65248 is subtracted from the Unicode code of the full-angle character to obtain a corresponding marked half-angle character;

and when the code of the character is equal to U +3000, the marked character is a full-angle character, and the code of the full-angle character is converted into U +0020 to obtain a corresponding half-angle character after marking.

4. The big data-based text word segmentation method according to claim 1, wherein the step of deleting the binary candidate words, the ternary candidate words and the quaternary candidate words that satisfy a preset condition from the binary text segment set, the ternary text segment set and the quaternary text segment set to obtain a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank comprises:

acquiring binary candidate words, ternary candidate words and quaternary candidate words in the binary text segment set, the ternary text segment set and the quaternary text segment set;

respectively acquiring the frequency of each binary candidate word in the corpus, the frequency of each ternary candidate word in the corpus and the frequency of each quaternary candidate word in the corpus;

and when the frequencies of the binary candidate words, the ternary candidate words and the quaternary candidate words are less than the corresponding preset frequencies, the preset conditions are met, the binary candidate words, the ternary candidate words and the quaternary candidate words meeting the preset conditions are deleted, and a binary candidate word library, a ternary candidate word library and a quaternary candidate word library are obtained, wherein the preset conditions comprise that one or more of the frequencies of the binary candidate words, the ternary candidate words and the quaternary candidate words are less than the corresponding preset frequencies.

5. The big data-based text word segmentation method according to claim 1, wherein the step of deleting the binary candidate words, the ternary candidate words and the quaternary candidate words that satisfy a preset condition from the binary text segment set, the ternary text segment set and the quaternary text segment set to obtain a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank comprises:

obtaining the degree of freedom and the degree of condensation of each binary candidate word, each ternary candidate word and each quaternary candidate word;

and if the degree of freedom of the binary candidate words, the ternary candidate words and the quaternary candidate words is less than the preset degree of freedom corresponding to the candidate words, or the degree of condensation of the binary candidate words, the ternary candidate words and the quaternary candidate words is less than the preset degree of condensation corresponding to the candidate words, the preset condition is met, the binary candidate words, the ternary candidate words and the quaternary candidate words meeting the preset condition are deleted, and a binary candidate word library, a ternary candidate word library and a quaternary candidate word library are obtained.

6. The big data based text word segmentation method according to claim 5, wherein the degree of cohesion of the candidate words is equal to the ratio of the probability of the candidate words appearing in the corpus to the probability of a single word element in the candidate words in the corpus; the degree of freedom H of the candidate word is as follows:

H＝min{s’,s”}；

wherein,

h represents the degree of freedom of the candidate word, s' represents the right entropy of the candidate word,s' is the left entropy of the candidate word, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom; b_iRight neighbourhood of candidate words, n_biDenotes b_iThe frequency number appearing on the right side of the candidate word, K represents the number of character elements in the right adjacent character set of the candidate word and is the left entropy of the candidate word; m is_iLeft adjacent word set belonging to candidate words, n_miRepresents m_iThe frequency number appearing on the left side of the candidate word, and M represents the number of the character elements in the left adjacent character set of the candidate word.

7. The big data-based text word segmentation method according to claim 1, wherein before the step of deleting the binary candidate words, the ternary candidate words and the quaternary candidate words that satisfy the preset conditions in the binary text segment set, the ternary text segment set and the quaternary text segment set, the step of obtaining the binary candidate word bank, the ternary candidate word bank and the quaternary candidate word bank further comprises:

traversing candidate words in each text fragment set;

judging whether the candidate word is contained in a dictionary word bank corresponding to the candidate word;

and if the candidate word is contained in the corresponding dictionary word bank, deleting the candidate word from the corresponding text segment set.

8. A big data-based text word segmentation device is characterized by comprising:

9. A computer device comprising a memory having computer readable instructions stored therein and a processor that when executed performs the steps of the big-data based text word segmentation method according to any one of claims 1 to 7.

10. A computer-readable storage medium having computer-readable instructions stored thereon which, when executed by a processor, implement the steps of the big-data based text segmentation method according to any one of claims 1 to 7.