CN114385792B - Method, device, equipment and storage medium for extracting words from work order data - Google Patents

Method, device, equipment and storage medium for extracting words from work order data Download PDF

Info

Publication number
CN114385792B
CN114385792B CN202210287345.4A CN202210287345A CN114385792B CN 114385792 B CN114385792 B CN 114385792B CN 202210287345 A CN202210287345 A CN 202210287345A CN 114385792 B CN114385792 B CN 114385792B
Authority
CN
China
Prior art keywords
words
word
chinese character
word set
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210287345.4A
Other languages
Chinese (zh)
Other versions
CN114385792A (en
Inventor
汤灏
胡灿
包利安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zero Data Technology Co ltd
Beijing Zero Vision Network Technology Co ltd
Original Assignee
Beijing Zero Data Technology Co ltd
Beijing Zero Vision Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zero Data Technology Co ltd, Beijing Zero Vision Network Technology Co ltd filed Critical Beijing Zero Data Technology Co ltd
Priority to CN202210287345.4A priority Critical patent/CN114385792B/en
Publication of CN114385792A publication Critical patent/CN114385792A/en
Application granted granted Critical
Publication of CN114385792B publication Critical patent/CN114385792B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The method comprises the steps of obtaining work order text information, processing the work order text information to obtain at least one Chinese character combination to be analyzed, calculating a solidification value corresponding to each Chinese character combination to be analyzed and a corresponding information entropy value, wherein the information entropy value comprises at least one of a left information entropy value and a right information entropy value, determining an initial word set based on the information entropy value corresponding to each Chinese character combination to be analyzed and the solidification value, and screening the initial word set to obtain a required word set. The method and the device have the effect of conveniently obtaining the key words in the work order.

Description

Method, device, equipment and storage medium for extracting words from work order data
Technical Field
The present application relates to the field of information processing, and in particular, to a method, an apparatus, a device, and a storage medium for extracting words from work order data.
Background
In the fields of government affairs, shopping evaluation, news and the like, the content sent by the user through the network platform is usually complicated, and the simple and effective information is not convenient to be combed out from the information released by the user.
The information published by the user usually represents the key information in the content published by the user by words in the information, at present, the word segmentation is usually carried out by using an algorithm based on a hidden Markov model, a CRF conditional random field and the like, a plurality of words are obtained, the obtained words are all words in the content published by the user, the words representing the key information are contained in all words, and the required key words are not easy to obtain by work order processing personnel.
Disclosure of Invention
In order to obtain key words in a work order, the application provides a method, a device, equipment and a storage medium for extracting words from work order data.
In a first aspect, the present application provides a method for extracting words from work order data, which adopts the following technical scheme:
a method of work order data extraction terms, comprising:
acquiring work order text information;
processing the work order text information to obtain at least one Chinese character combination to be analyzed;
calculating a corresponding coagulation value and a corresponding information entropy value of each Chinese character combination to be analyzed, wherein the information entropy value comprises at least one of a left information entropy value and a right information entropy value;
determining an initial word set based on the information entropy value and the freezing value corresponding to each Chinese character combination to be analyzed;
and screening the initial word set to obtain a required word set.
By adopting the technical scheme, the work order text information is the content issued by the user, and after the work order text information issued by the user is obtained, the work order text information is composed of words, so that the words in the work order text information need to be processed firstly to obtain the Chinese character combination to be analyzed. The Chinese character combination to be analyzed comprises words. And then calculating the information entropy value and the solidification value corresponding to each Chinese character combination, and determining all Chinese character combinations belonging to the words according to the information entropy value and the solidification value of each Chinese character combination, namely an initial word set, after whether one Chinese character combination belongs to the words and is related to the information entropy value and the solidification value of the word. And screening the initial word set, and filtering out words which are not needed by the processing personnel, thereby being convenient for obtaining key words needed by the processing personnel.
In another possible implementation manner, the processing the work order text information to obtain at least one chinese character combination to be analyzed includes:
filtering non-Chinese character information in the text information to obtain a character sequence;
forward scanning the character sequence according to a preset step length by a preset window length to obtain at least one Chinese character combination;
judging whether the occurrence frequency of each Chinese character combination reaches a first preset frequency threshold value or not;
and determining the Chinese character combination with the occurrence frequency reaching a first preset frequency threshold value as the Chinese character combination to be analyzed.
By adopting the technical scheme, the work order text information usually comprises non-Chinese character information such as punctuation marks, and the obtained Chinese character combination is more accurate by filtering the non-Chinese character information. The character sequence is scanned forwards through the preset window length preset step length, so that the required Chinese character combination is obtained, if the occurrence frequency of the Chinese character combination is lower than a first preset frequency threshold value, the fact that the Chinese character combination belongs to the accidental Chinese character combination is proved, and the fact that the Chinese character combination does not have practical significance is quite possible, so that the Chinese character combination with the occurrence frequency smaller than the first preset frequency threshold value is filtered, the Chinese character combination to be analyzed is obtained, and the subsequent calculated amount is reduced.
In another possible implementation manner, the determining an initial word set based on the information entropy value and the solidity value corresponding to each chinese character combination to be analyzed includes:
determining an information entropy threshold interval corresponding to each Chinese character combination to be analyzed based on the length of each Chinese character combination to be analyzed;
judging whether the information entropy value of each Chinese character combination to be analyzed is positioned in the corresponding information entropy threshold interval;
generating a first word set based on the Chinese character combination of which the information entropy value is positioned in the corresponding information entropy threshold value interval;
comparing the freezing degree value of each Chinese character combination to be analyzed in the first word set with a preset freezing degree threshold value;
and generating an initial word set based on the Chinese character combination with the coagulation degree value larger than a preset coagulation degree threshold value.
By adopting the technical scheme, the information entropy represents the richness degree of the left adjacent characters and the right adjacent characters of the Chinese character combination, and the possibility that the Chinese character combination belongs to words is high if more abundant Chinese characters can be matched on the left and the right of one Chinese character combination. The solidity value represents the degree of fixation of the Chinese characters in the Chinese character combination, and the higher the degree of fixation in the Chinese character combination, the higher the possibility that the Chinese character combination becomes a word is. After the information entropy value and the freezing value of each Chinese character combination are obtained, the corresponding information entropy threshold interval is determined according to the length of each Chinese character combination to be analyzed, so that the judgment based on the information entropy value is more accurate. The information entropy value of the Chinese character combination to be analyzed is located in the corresponding information entropy threshold interval, the probability that the Chinese character combination to be analyzed becomes a word is high, the coagulation degree value of the Chinese character combination to be analyzed, of which the information entropy value is located in the corresponding information entropy threshold interval, is analyzed, the Chinese character combination which does not reach the preset coagulation degree threshold value is further filtered based on the preset coagulation degree threshold value, the rest Chinese character combinations which are filtered through the information entropy value and the coagulation degree value are all words, and the set formed by all the words is the initial word set. Whether the Chinese character combination belongs to the words or not is determined more accurately through the information entropy value and the solidification value of the Chinese character combination.
In another possible implementation manner, the filtering the initial word set to obtain a desired word set includes:
filtering out the stop words in the initial word set based on a preset stop word bank to obtain a second word set except the stop words, wherein the stop words are words hitting the preset stop word bank;
synonymy replacing words hitting the preset synonym library in the second word set based on a preset synonym library to obtain a third word set;
determining the words hit on the preset white list in the third word set as required words, and generating the required word set based on the required words.
By adopting the technical scheme, after the initial word set is determined, the words which are not used in the initial word set are filtered according to the non-use word bank, and the data volume is reduced. And performing synonymy replacement on the words hitting the synonym library according to the synonym library, so that the words are more standard, reserving the words hitting the preset white list according to the preset white list, wherein the words hitting the preset white list are the words required by the processing personnel, and filtering the initial word set layer by layer through the disuse word library, the synonym library and the preset white list to finally obtain the words required by the processing personnel.
In another possible implementation manner, the method further includes:
sequencing the occurrence times of each required word in the required word set;
and outputting the required words in the required word set according to the sorting result.
By adopting the technical scheme, after the required words of the processing personnel are obtained, the occurrence times of the required words are sequenced, and the required words are output according to the sequencing result. Therefore, the processing personnel can conveniently know the occurrence frequency condition of each required word, and can further conveniently know the attention hot spot of the user.
In another possible implementation manner, the method further includes:
searching for a drop word in a third word set based on a corresponding relation, wherein the corresponding relation is the corresponding relation between each word in the third word set and corresponding work order text information, the corresponding relation is established after the third word set is obtained, and the drop word is a word which does not hit a preset white list in the work order text information corresponding to the required word;
if any selected word meets a first preset condition, storing the selected word to the preset white list;
the first preset condition includes at least one of:
the occurrence frequency of any one of the selected words in the corresponding work order text information reaches a second preset frequency threshold value;
and the occurrence frequency of a third word set of the hit history of any selected word reaches a third preset frequency threshold value.
By adopting the technical scheme, new words belonging to the field of the preset white list may exist in the work order text information corresponding to the required words. Therefore, the selected words in the work order with the required words are found according to the corresponding relation between the required words and the corresponding work order text information, whether the times of the selected words appearing in the corresponding work order text information reach a second preset time threshold value or not is judged, and the second preset time threshold value indicates that the selected words are not accidentally appearing words and belong to words in the corresponding field of the preset white list. And judging whether the appearance words of the spiral words in the historical third word set reach a third preset frequency threshold value or not, wherein the appearance words of the spiral words in the historical third word set are shown to be appeared for multiple times in the historical third word set and belong to words in the corresponding field of the preset white list if the appearance words of the spiral words in the historical third word set reach the third preset frequency threshold value. The effect of updating the preset white list is achieved by determining new words in the selected words which belong to the field of the preset white list, so that the needed words are determined more accurately in the follow-up process.
In another possible implementation manner, the method further includes:
acquiring a historical word set, wherein the historical word set is a set of required words determined in a first preset time period in the past;
determining a comparison word set from the historical word set, wherein the comparison word set is a set of required words in a second preset time period in the historical word set;
determining words in the comparison word set which meet a second preset condition,
the second preset condition comprises:
the front preset items are sorted from large to small according to the occurrence times;
and outputting hot spot change information based on the words meeting the second preset condition and the required word set.
By adopting the technical scheme, the comparison word set is determined from the historical word set, the words in the comparison word set are sorted from large to small according to the occurrence times, the front preset words after being sorted from large to small are determined, the words of the preset items before the occurrence times in the comparison word set are hot words in the comparison word set, the change of the work order hot spot of the user can be determined based on the hot words and the required words, the hot spot change information is output, and therefore the processing personnel can know the hot spot change more intuitively.
In a second aspect, the present application provides a device for extracting words from work order data, which adopts the following technical scheme:
an apparatus for work order data extraction of terms, comprising:
the text acquisition module is used for acquiring work order text information;
the text processing module is used for processing the work order text information to obtain at least one Chinese character combination to be analyzed;
the calculation module is used for calculating a corresponding coagulation value and a corresponding information entropy value of each Chinese character combination to be analyzed, wherein the information entropy value comprises at least one of a left information entropy value and a right information entropy value;
the first word determining module is used for determining an initial word set based on the information entropy value and the solidification value corresponding to each Chinese character combination to be analyzed;
and the word screening module is used for screening the initial word set to obtain a required word set.
By adopting the technical scheme, the work order text information is the content issued by the user, the text acquisition module acquires the work order text information issued by the user, and the work order text information consists of words, so that the text processing module is required to process the work order text information to obtain the Chinese character combination to be analyzed before the words in the work order text information are obtained. The Chinese character combination to be analyzed comprises words. And after the information entropy value and the solidification degree corresponding to each Chinese character combination are calculated, the first word determining module determines all Chinese character combinations belonging to the words according to the information entropy value and the solidification degree of each Chinese character combination, namely an initial word set. The word screening module screens the initial word set again, and the words which are not needed by the processing personnel are filtered, so that the key words needed by the processing personnel can be obtained conveniently.
In another possible implementation manner, when the text processing module processes the work order text information to obtain at least one to-be-analyzed Chinese character combination, the text processing module is specifically configured to:
filtering non-Chinese character information in the text information to obtain a character sequence;
forward scanning the character sequence according to a preset step length by a preset window length to obtain at least one Chinese character combination;
judging whether the occurrence frequency of each Chinese character combination reaches a first preset frequency threshold value or not;
and determining the Chinese character combination with the occurrence frequency reaching a first preset frequency threshold value as the Chinese character combination to be analyzed.
In another possible implementation manner, when determining the initial word set based on the information entropy value and the solidity value corresponding to each chinese character combination to be analyzed, the first word determination module is specifically configured to:
determining an information entropy threshold interval corresponding to each Chinese character combination to be analyzed based on the length of each Chinese character combination to be analyzed;
judging whether the information entropy value of each Chinese character combination to be analyzed is positioned in the corresponding information entropy threshold interval;
generating a first word set based on the Chinese character combination of which the information entropy value is positioned in the corresponding information entropy threshold value interval;
comparing the freezing degree value of each Chinese character combination to be analyzed in the first word set with a preset freezing degree threshold value;
and generating an initial word set based on the Chinese character combination with the coagulation degree value larger than a preset coagulation degree threshold value.
In another possible implementation manner, when the term filtering module filters the initial term set to obtain a desired term set, the term filtering module is specifically configured to:
filtering out the stop words in the initial word set based on a preset stop word bank to obtain a second word set except the stop words, wherein the stop words are words hitting the preset stop word bank;
synonymy replacing words hitting the preset synonym library in the second word set based on a preset synonym library to obtain a third word set;
determining the words hit on the preset white list in the third word set as required words, and generating the required word set based on the required words.
In another possible implementation manner, the apparatus further includes:
the sorting module is used for sorting the occurrence times of each required word in the required word set;
and the first output module is used for outputting the required words in the required word set according to the sequencing result.
In another possible implementation manner, the apparatus further includes:
the searching module is used for searching for the selected words in the third word set based on a corresponding relationship, the corresponding relationship is the corresponding relationship between each word in the third word set and the corresponding work order text information, the corresponding relationship is established after the third word set is obtained, and the selected words are words which do not hit a preset white list in the work order text information corresponding to the required words;
the word storage module is used for storing any selected word to the preset white list when the selected word meets a first preset condition;
the first preset condition includes at least one of:
the occurrence frequency of any one of the selected words in the corresponding work order text information reaches a second preset frequency threshold value;
and the occurrence frequency of a third word set of the hit history of any selected word reaches a third preset frequency threshold value.
In another possible implementation manner, the apparatus further includes:
the historical word set acquisition module is used for acquiring a historical word set, wherein the historical word set is a set of required words determined in a first preset time period in the past;
the second word determining module is used for determining a comparison word set from the historical word set, wherein the comparison word set is a set of required words in a second preset time period in the historical word set;
a third word determining module for determining words in the comparison word set which meet a first preset condition,
the second preset condition includes:
the front preset items are sorted from large to small according to the occurrence times;
and the second output module is used for outputting hot spot change information based on the words meeting the second preset condition and the required word set.
In a third aspect, the present application provides an electronic device, which adopts the following technical solutions:
an electronic device, comprising:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more application programs configured to: a method for work order data extraction terms according to any one of the possible implementations of the first aspect is performed.
In a fourth aspect, the present application provides a computer-readable storage medium, which adopts the following technical solutions:
a computer-readable storage medium, comprising: a computer program is stored which can be loaded by a processor and which implements a method for extracting words from work order data as shown in any one of the possible implementations of the first aspect.
In summary, the present application includes at least one of the following beneficial technical effects:
1. the work order text information is the content issued by the user, and after the work order text information issued by the user is obtained, the work order text information is composed of words, so that the words in the work order text information need to be processed firstly to obtain the Chinese character combination to be analyzed. The Chinese character combination to be analyzed comprises words. And then calculating the information entropy value and the solidification value corresponding to each Chinese character combination, and determining all Chinese character combinations belonging to the words according to the information entropy value and the solidification value of each Chinese character combination, namely an initial word set, after whether one Chinese character combination belongs to the words and is related to the information entropy value and the solidification value of the word. Then, the initial word set is screened, and words which are not needed by processing personnel are filtered, so that key words needed by the processing personnel can be obtained conveniently;
2. and after the initial word set is determined, filtering out the words which are not used in the initial word set according to the non-use word bank, and reducing the data volume. And performing synonymy replacement on the words hitting the synonym library according to the synonym library, so that the words are more standard, reserving the words hitting the preset white list according to the preset white list, wherein the words hitting the preset white list are the words required by the processing personnel, and filtering the initial word set layer by layer through the disuse word library, the synonym library and the preset white list to finally obtain the words required by the processing personnel.
Drawings
FIG. 1 is a flowchart illustrating a method for extracting terms from work order data according to an embodiment of the present disclosure.
Fig. 2 is an exemplary diagram of step S110 in the embodiment of the present application.
FIG. 3 is a schematic structural diagram of a device for extracting terms from work order data according to an embodiment of the present application.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The present application is described in further detail below with reference to the attached drawings.
A person skilled in the art, after reading the present specification, may make modifications to the present embodiments as necessary without inventive contribution, but only within the scope of the claims of the present application are protected by patent laws.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship, unless otherwise specified.
The embodiments of the present application will be described in further detail with reference to the drawings attached hereto.
The embodiment of the application provides a method for extracting words from work order data, which is applicable to various application scenarios, for example, extracting key words from government affair information uploaded by mass users. In the following embodiments, the key words are extracted from the administration information, but the present invention is not limited to the above-described scenarios, and the details are shown in the following embodiments.
The embodiment of the application provides a method for extracting words from work order data, which is executed by electronic equipment, wherein the electronic equipment can be a server or terminal equipment, the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud computing service. The terminal device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., but is not limited thereto, the terminal device and the server may be directly or indirectly connected through a wired or wireless communication manner, and the embodiment of the present application is not limited thereto, as shown in fig. 1, the method includes step S101, step S102, step S103, step S104, and step S105, wherein,
and S101, acquiring work order text information.
For the embodiment of the application, the crowd users can upload the work order text information through the government affair platform by using the terminal device, and the government affair platform can be a webpage, can also be application software, and can also be a platform for which other forms of users can upload the work order text information, which is not limited herein. The work order text information comprises government affair information uploaded by a user, such as ' existence of illegal buildings in a certain place for years ' and existence of a removal plan '. ". The electronic equipment can store the work order text information after acquiring the work order text information, and the work order text information can be stored in the cloud server or a storage device in the electronic equipment. And when the work order text information needs to be processed, acquiring the stored work order text information.
And S102, processing the work order text information to obtain at least one Chinese character combination to be analyzed.
For the embodiment of the application, the worksheet text information uploaded by the user is usually composed of sentences, each sentence is composed of a single Chinese character, and the Chinese character combination composed of the single Chinese character and the adjacent Chinese characters can be meaningful words or meaningless Chinese character combinations. Namely, the words belong to the Chinese character combination, so that the work order text information is processed to obtain at least one Chinese character combination to be analyzed, and the words can be conveniently obtained from the Chinese character combination to be analyzed subsequently. In step S101, the "existence of a violation building in a certain place" indicates whether there is a plan for removal or not. "is an example. The Chinese character combination to be analyzed can be 'somewhere', 'somewhere' and 'illegal building', etc.
S103, calculating a corresponding coagulation value and a corresponding information entropy value of each Chinese character combination to be analyzed.
Wherein the information entropy value comprises at least one of a left information entropy value and a right information entropy value.
For the embodiments of the present application, the solidity value and the information entropy value are generally used to characterize whether a Chinese character combination is a word. The solidity value represents the closeness degree between characters in a Chinese character combination, the higher the closeness degree between the characters is, and the higher the possibility that the Chinese character combination belongs to a word is, taking the Chinese character combination of three characters as an example, the calculation formula of the solidity value of the Chinese character combination of three characters is as follows:
Figure DEST_PATH_IMAGE001
p (ABC) in the formula for calculating the freezing degree value is the probability of the Chinese character combination ABC appearing in all the work order text information, p (A) is the probability of the Chinese character A appearing in all the single Chinese characters, p (C) is the probability of the Chinese character C appearing in all the single Chinese characters, p (AB) is the probability of the Chinese character combination AB appearing in all the work order text information, and p (BC) is the probability of the Chinese character combination BC appearing in all the Chinese characters.
Suppose that there are two Chinese character combinations of "rent public" and "rent house", and "rent public" occurs 389 times and "rent house" occurs 175 times in the work order text message. Visually, the 'public rental housing' can be more a word than the 'public rental', and the work order text information stored in the local or cloud server is assumed to have 2400 ten thousand Chinese characters in total, wherein 389 times of 'public rental', '398400 times of' public rental ', 2774 times of' public ', 1000 times of' public ', and' 6000 times of 'rental'. "Male" occurs 3000 times. The calculation result of the coagulation degree value corresponding to the calculated 'with the ancestor' is as follows:
Figure 753765DEST_PATH_IMAGE002
the minimum value of 8.5 and 1553, 8.5, was taken as the solidity value of "fairly hired", i.e., the solidity value of "fairly hired" was 8.5.
Suppose that the Chinese character combination of ABC is ' Gong rent house ', ' Gong ' appears 3000 times, ' rent house ' appears 4000 times ' and ' house ' appears 4797 times. The calculation result of the solidification value corresponding to the public rental house is as follows:
Figure DEST_PATH_IMAGE003
the minimum value 318 of 349 and 318 is taken as the solidity value of "public rental house", i.e. the solidity value of "public rental house" is 318.
In order to prove that the internal solidification degree of the word "public lesson house" is really high, if the "public lesson" and the "house" respectively and independently appear randomly in the text, the "public lesson" appears 2774 times in the whole 2400 ten thousand words of data, and the probability of the appearance is about 0.000116. The "house" word occurs 4797 times with a probability of about 0.000119. If the public lessons and the rooms are unrelated, the probability that the public lessons and the rooms are combined together is 0.000116 multiplied by 0.000119 which is 0.0000000231. However, "public rental houses" appear 175 times in a corpus containing all the work order text information, and the appearance probability is about 0.00000729, which is 318 times of the predicted value.
Similarly, the probability of occurrence of the word "having" is calculated to be about 0.0166 when there are 398400 times, so that the probability value of randomly combining "having" and "public lesson" is 0.0166 × 0.000116 ═ 0.00000193, which is very close to the true probability of occurrence of "public lesson", which is about 0.0000162, and 8.5 times the predicted value. The calculation result shows that the 'public rental house' is more likely to be a meaningful collocation, and the 'public rental' is more like that two words of the 'public rental' and the 'public rental' are accidentally pieced together.
And according to the solidification value calculation formula, the calculation result of 'rented public' is less than that of 'rented public house'. So "public rental housing" is more likely to be a word.
The information entropy value is used for measuring how random a left adjacent character set and/or a right adjacent character set of a Chinese character combination are. The left adjacent characters and the Chinese character combination form a left continuous word of the Chinese character combination, and the right adjacent characters and the Chinese character combination form a right continuous word of the Chinese character combination.
Figure 359321DEST_PATH_IMAGE004
And p (w) in the formula for calculating the information entropy is the probability of the occurrence of the left continuation word or the right continuation word in all the work order text information. That is, a reasonable word should have strong adaptability and can be used with different left-adjacent characters or right-adjacent characters.
In the embodiment of the application, the information entropy value of the Chinese character combination can be calculated firstly, and then the freezing value of the Chinese character combination can be calculated; or the freezing degree value of the Chinese character combination can be calculated firstly, and then the information entropy value of the Chinese character combination can be calculated; the information entropy value and the information entropy value of the Chinese character combination can be calculated at the same time, and the execution sequence of calculating the information entropy value and the information entropy value is not limited herein.
And S104, determining an initial word set based on the information entropy value and the solidification value corresponding to each Chinese character combination to be analyzed.
For the embodiment of the application, whether the Chinese character combination belongs to the word or not is determined by the two aspects of the information entropy value and the solidity value of the Chinese character combination. And after the information entropy value and the solidification value of each character combination to be analyzed are obtained, determining whether each Chinese character combination to be analyzed is a word or not according to the information entropy value and the solidification value of each Chinese character combination to be analyzed, wherein the words in all the Chinese character combinations to be analyzed form an initial word set.
And S105, screening the initial word set to obtain a required word set.
In the embodiment of the application, the obtained initial term set only counts all terms in the work order text information, and the initial term set comprises terms which can reflect the key information in the work order most, so that the required term set consisting of the key terms in the initial term set is obtained by screening the initial term set. The required word set is obtained by screening from the initial word set, so that the processing personnel can obtain the key words in the work order, and the user demands can be more conveniently known.
In a possible implementation manner of the embodiment of the present application, the processing of the work order text information in step S102 to obtain at least one chinese character combination to be analyzed specifically includes step S1021 (not shown in the figure), step S1022 (not shown in the figure), step S1023 (not shown in the figure), and step S1024 (not shown in the figure), wherein,
and S1021, filtering non-Chinese character information in the text information to obtain a character sequence.
For the embodiment of the application, the non-Chinese character information in the work order text information includes English letters, Arabic numerals, punctuation marks and the like. The existence of the non-Chinese character information occupies character positions, and the electronic equipment cannot identify the non-Chinese character information in the simplex text information. Therefore, the obtained Chinese character combination is influenced.
In step S101, it is determined whether there is a plan for removal or not, for a plurality of years, that there is a violation building in a certain place. For example, if punctuation marks in the text information are not filtered out, a Chinese character combination with "year", "yes", and the like can be obtained. The Chinese character combination with non-Chinese character combination belongs to meaningless Chinese character combination, so that the non-Chinese character information needs to be filtered.
By filtering out non-Chinese character information. Thereby obtaining a character sequence and enabling the word segmentation result to be more accurate. The filtering non-character information can be filtered through a regular expression, and a continuous character sequence is obtained after the non-Chinese character information in the work order text information is filtered. A continuous sequence of words is scanned. Only Chinese characters and no other non-Chinese character information in the Chinese character combination are obtained, so that the Chinese character combination is more accurate.
Take step S101 as an example. For the illegal buildings at a certain place, the removal plan exists for years or not. The non-Chinese character information is filtered to obtain a character sequence of 'whether a violation building at a certain position has a removal plan for years' or not.
S1022, the character sequence is scanned forward according to the preset step length through the preset window length, and at least one Chinese character combination is obtained.
For the present application examples. The preset length may be two chinese character lengths, three chinese character lengths and four chinese character lengths. The preset step size is usually one Chinese character in length. The character sequence obtained in step S1021 is scanned forward for two characters according to the window length and one character in step length to obtain the following combinations of "a certain character", "a certain position", "a position", "violation of the character", "chapter establishment", "building", and the like. The forward scanning is carried out according to the length of three Chinese character windows and one Chinese character step length, so that the following Chinese character combinations of 'a certain position', 'violation of position', and 'violation' are obtained. If the window length is too long, for example, five chinese characters are used as the window length to scan to obtain a chinese character combination with a length of five, the chinese character combination with a length of five chinese characters may be formed by two words, and thus the obtained chinese character combination has no reference meaning.
And S1023, judging whether the occurrence frequency of each Chinese character combination reaches a first preset frequency threshold value.
For the embodiment of the application, the first preset frequency threshold value is assumed to be 2 times, after at least one Chinese character combination is obtained, the frequency of each Chinese character combination appearing in the whole worksheet text information is counted, the Chinese character combination which does not reach the first preset frequency threshold value of 2 is filtered out, if the frequency of one Chinese character combination appearing is too low, the Chinese character combination is proved to have great possibility of no practical significance, and the Chinese character combination which does not have practical significance is filtered out, so that the calculated amount is reduced, and the calculation efficiency is improved. And filtering the Chinese character combinations with too low occurrence frequency, wherein the rest Chinese character combinations are the Chinese character combinations to be analyzed.
In the embodiment of the present application, before step S1023, the words with special attributes may also be identified through a named entity identification model. Words of special attributes such as person name, place name, street name, and company name, etc. The named entity recognition model may be a neural network model, wherein the neural network model may be a convolutional neural network or a cyclic neural network, and is not limited herein. And inputting at least one Chinese character combination or work order text information into the trained named entity recognition model for special attribute word recognition, thereby obtaining words with special attributes. The occurrence frequency of the words may be lower than a first preset frequency threshold value, but also belong to important information in the work order text information, so that the words with special attributes are reserved, and the word segmentation accuracy is improved.
And S1024, determining the Chinese character combination with the occurrence frequency reaching a first preset frequency threshold value as the Chinese character combination to be analyzed.
For the embodiment of the application, if the occurrence frequency of the Chinese character combination reaches the first preset frequency threshold, the Chinese character combination does not occur accidentally, so that the Chinese character combination has practical significance. Therefore, the Chinese character combination with the occurrence frequency reaching the first preset frequency threshold value is determined to be the Chinese character combination to be analyzed more accurately, and the required words screened out subsequently are more accurate.
In a possible implementation manner of the embodiment of the present application, the determining the initial word set based on the information entropy and the solidity value corresponding to each combination of chinese characters to be analyzed in step S104 specifically includes step S1041 (not shown in the figure), step S1042 (not shown in the figure), step S1043 (not shown in the figure), step S1044 (not shown in the figure), and step S1045 (not shown in the figure), wherein,
s1041, determining an information entropy threshold interval corresponding to each Chinese character combination to be analyzed based on the length of each Chinese character combination to be analyzed.
For the embodiment of the application, the information entropy of the Chinese character combination to be analyzed represents the richness of the left adjacent words and the right adjacent words of the Chinese character combination, and the more the left adjacent words and the right adjacent words of a certain Chinese character combination, the more the Chinese character combination can flexibly appear in various different environments. The Chinese character combination has rich left-adjacent character sets and right-adjacent character sets, and further shows that the probability of the Chinese character combination becoming a word is higher. Therefore, the lower information entropy threshold is set, that is, if the richness of the left adjacent words and the right adjacent words of a Chinese character combination is too low, the lower the information entropy value of the Chinese character combination is, the lower the possibility that the Chinese character combination belongs to a word is. The first Chinese character combination obtained by scanning the character sequence has no left information entropy value, and the last Chinese character combination has no right information entropy value, so the possibility that the first Chinese character combination belongs to the word can be judged only according to whether the right information entropy of the first Chinese character combination is in the corresponding information entropy interval. And judging the possibility that the last Chinese character combination belongs to the words according to whether the left information entropy of the last Chinese character combination is in the corresponding information entropy interval. Except the first Chinese character combination and the last Chinese character combination, the other Chinese character combinations have left information entropy values and right information entropy values.
If the left and right adjacent words of a Chinese character combination to be analyzed are too many, for example, the Chinese character combination is 'one', the 'one' can be 'eaten one time', 'looked at one time', 'slept one night', 'go one time', and the like, but the 'one' does not belong to the words, so the maximum value of the information entropy threshold value needs to be set. The length of the Chinese character combination is positively correlated with the richness of the left and right adjacent words, i.e. the longer the length of the Chinese character belonging to a word is, the less the left and right adjacent words are likely to be. Therefore, different information entropy threshold intervals are determined according to the length of the Chinese character combination, and the word determination result is more accurate.
S1042, judging whether the information entropy value of each Chinese character combination to be analyzed is within the corresponding information entropy threshold interval.
For the embodiment of the application, after the corresponding information entropy threshold interval is determined according to each Chinese character combination, whether the information entropy value corresponding to the Chinese character combination is in the corresponding information entropy threshold interval is judged, and if the information entropy value is in the corresponding information entropy threshold interval, the possibility that the Chinese character combination belongs to a word is high. If the Chinese character combination only has the left information entropy or only has the right information entropy in the information entropy threshold interval, the Chinese character combination cannot be described as a word, such as a white skin word and a book, but the richness of adjacent characters on any side is low, the adjacent characters cannot be independently used as a word, otherwise, the three characters of white, skin and book form two words of the white skin word and the book. Therefore, when the left information entropy and the right information entropy are both required to be located within the information entropy threshold interval, the Chinese character combination is reserved.
And S1043, generating a first word set based on the Chinese character combinations of which the information entropy values are positioned in the corresponding information entropy threshold intervals.
For the embodiment of the application, whether the information entropy value of the Chinese character combination is located in the corresponding information entropy threshold interval or not is judged, the Chinese character combination of which the part does not belong to the word is filtered, and the Chinese character combination of which the information entropy value is located in the corresponding information entropy threshold interval belongs to the Chinese character combination which is more likely to become the word, so that the Chinese character combination of which the information entropy value is located in the corresponding information entropy threshold interval is generated into the first word set for subsequent processing.
And S1044, comparing the solidity value of each Chinese character combination to be analyzed in the first word set with a preset solidity threshold value.
For the embodiment of the present application, taking "public lesson" and "public lesson house" in step S103 as an example, the degree of solidification represents the degree of internal fixation of two chinese character combinations of "public lesson" and "public lesson house", and if "public lesson" matches an idiom, the degree of solidification of "public lesson" is high, and "public lesson" belongs to a single word. It is known from the example in step S103 that the degree of solidification of "public rental housing" is higher than that of "public rental", and assuming that the preset degree of solidification threshold is 50, "public rental housing" reaches the preset degree of solidification threshold, "public rental housing" does not reach the preset degree of solidification threshold, "public rental housing" belongs to a word, and "public rental" does not belong to a word. Therefore, the Chinese character combinations with low internal fixed degree are filtered by presetting the threshold value of the degree of solidification.
And S1045, generating an initial word set based on the Chinese character combination of which the solidification degree value is greater than a preset solidification degree threshold value.
For the embodiment of the application, the electronic equipment further filters the Chinese character combinations which do not belong to the words according to the preset solidity threshold value, so that all the words in the work order text information are obtained. All words in the work order text information constitute an initial set of words.
In a possible implementation manner of the embodiment of the present application, the step S105 of screening the initial term set to obtain a required term set specifically includes a step S1051 (not shown in the figure), a step S1052 (not shown in the figure), and a step S1053 (not shown in the figure), wherein,
s1051, filtering out the stop words in the initial word set based on a preset stop word bank to obtain a second word set except for the stop words.
The non-stop words are words hitting a preset non-stop word bank.
For the embodiment of the present application, the stop word library includes stop words, such as words "o", "la", "woolen", etc., and also may be auxiliary words "and the like, and may also be other words, such as sensitive words with the meaning of \35881, curdling, and scandalism. After the initial word set of the Chinese characters is obtained. And judging whether the stop word exists in the initial word set or not by stopping the word bank. And filtering the stop words according to the stop word library to obtain a second word set. And filtering out the words with stop words in the initial word set by stopping the word bank, thereby further reducing useless words.
If a new stop word appears, the processing personnel can store the newly appeared stop word into the preset stop word bank, so that the aim of updating the preset stop word bank is fulfilled.
And S1052, synonymy replacing the words hitting the preset synonym library in the second word set based on the preset synonym library to obtain a third word set.
For the embodiments of the present application, in order to make the final words more in line with the government regulations. And replacing the words in the words by presetting a synonym library. Such as "illegal building", "private construction" are synonyms of "illegal building". Through presetting a synonym library, the synonym of the illegal building is found out and replaced into the illegal building, so that the finally obtained words are more standard and accord with government regulations. And carrying out synonym replacement on the second term set to obtain a third term set.
In other embodiments, after the electronic device obtains the third word set, the electronic device may further match and correspond each word in the third word set with the work order text information corresponding to each word, to obtain a corresponding relationship between the word and the work order text information, and the processing staff is convenient to find the original work order corresponding to the word through the corresponding relationship, so that the processing staff can know the user appeal in more detail.
If any term has a new corresponding synonym, or a new term and a synonym corresponding to the new term, the processing personnel can store the new synonym of the term, or the new term and the synonym corresponding to the new term into the preset synonym library, so that the purpose of updating the preset synonym library is achieved.
And S1053, determining the words hit on the preset white list in the third word set as required words, and generating the required word set based on the required words.
For the embodiment of the application, the work order text information is finally processed by corresponding processing personnel, and one processing personnel may only process the work orders in one field. For example, only the words related to the illegal buildings exist in one preset white list, and only the words related to the road traffic exist in the other preset white list. And supposing that a processing person is only responsible for processing the work order in the field of the illegal buildings, the electronic equipment filters the words in the third word set through the preset white list corresponding to the illegal buildings. Assuming that the third words are 100 words in the set, filtering the words through a preset white list corresponding to the violation buildings, and hitting 20 words in the preset white list corresponding to the violation buildings, wherein the 20 words hitting the preset white list are the words required by the processing personnel. Therefore, in order to facilitate the processor to obtain the work order only belonging to the responsible field, the terms after synonym replacement are filtered through the preset white list to obtain the terms only belonging to the field, so that the processor can obtain the terms in the field from the work order conveniently. The words hit in the preset white list are the word set required by the processing personnel, namely the required word set.
If a new word belonging to the preset white list appears, the processing personnel can store the new word belonging to the preset white list into the preset white list, so that the aim of updating the preset white list is fulfilled.
In a possible implementation manner of the embodiment of the present application, step S105 further includes step S106 (not shown in the figure) and step S107 (not shown in the figure), wherein,
and S106, sequencing the occurrence times of each required word in the required word set.
For example, the words of hitting the preset white list corresponding to the illegal building include that the illegal building appears 10 times, the block appears 5 times, the market appearance appears 8 times, and the removal appears 12 times. The electronic equipment sorts according to the occurrence times of the words of the preset white list corresponding to the hit violation buildings, and the sorting can be carried out according to the descending of the occurrence times and can also be carried out according to the descending of the occurrence times. Taking the sorting from big to small as an example, the words of the white list corresponding to the hit illegal buildings are 'detached', 'illegal buildings', 'city appearance' and 'obstruction' in turn.
And S107, outputting the required words in the required word set according to the sequencing result.
For the embodiment of the present application, taking the sorting result in step S106 as an example, the electronic device may display the words by controlling a display device such as a display screen or a touch screen according to the manner of outputting the words in the required word set according to the sorting result, may also control a speaker device to broadcast the words, may also generate a text document of the words according to the sorting result, and may also output the words by other manners, which is not limited herein. Words are output according to the sorting result, so that processing personnel can learn the attention hot points of the user more intuitively.
In a possible implementation manner of the embodiment of the present application, the method further includes step S108 (not shown in the figure) and step S109 (not shown in the figure), wherein step S108 may be executed after step S1052, wherein,
and S108, searching the selected words in the third word set based on the corresponding relation.
The corresponding relation is the corresponding relation between each word in the third word set and the corresponding work order text information, the corresponding relation is established after the third word set is obtained, and the selected word is the word which does not hit the preset white list in the work order text information corresponding to the required word.
For the embodiment of the application, the violation building field is taken as an example, a certain required word is the violation building field, and a new word belonging to the violation building field may exist in the work order text information corresponding to the required word, but does not exist in the preset white list corresponding to the violation building. Therefore, the new words belonging to the illegal building field are added into the preset white list of the illegal building field, so that the words required by filtering the preset white list are more accurate.
Because the words obtained by word segmentation of the work order text information are in the third word set, the words except the required words belonging to the same work order text information can be found according to the corresponding relation, namely the selected words are dropped. Whether new words belonging to the illegal building field exist in the selected words or not is determined by analyzing the selected words.
In other embodiments, after the electronic device outputs the desired word set, the processor may select the desired word by mouse, touch screen click, physical button, or the like. After the electronic equipment detects that the processing personnel selects the required words, the electronic equipment can jump to the corresponding work order information according to the corresponding relation of the required words, so that the processing personnel can conveniently check the original work order text information.
And S109, if any selected word meets a first preset condition, storing any selected word to a preset white list.
Wherein the first preset condition comprises at least one of the following:
the occurrence frequency of any selected word in the corresponding work order text information reaches a second preset frequency threshold value;
and the number of times that any selected word hits the third word set of the history reaches a third preset number threshold.
For the embodiment of the application, the word of the preset white list corresponding to the illegal building hit in the work order 1 is assumed to be 'demolished', the electronic equipment finds the work order text information corresponding to 'demolition' as 'work order 1' according to the corresponding relation, and further finds the selection words in 'work order 1'. Assuming that a selected word in the work order 1 is ' forcible entry ', whether the forcible entry ' belongs to a new word in the illegal building field is determined by judging whether the forcible entry meets a preset condition.
Suppose that the "forcible removal" occurs 3 times in the "work order 1" and the second preset number threshold is 2 times. The electronic equipment compares 3 times with 2 times to obtain that the occurrence frequency of 'forced demolition' is more than twice, so that the 'forced demolition' is not a word which occurs accidentally but a new word which also belongs to the field of illegal buildings, and the electronic equipment stores the 'forced demolition' into a preset white list corresponding to the illegal buildings, so that the preset white list is accumulated and updated, and omission is not easy to occur when required words are determined subsequently.
Assuming that the third preset time threshold is 3 times, the electronic device searches the times of 'forced demolition' in the third word set according to 'forced demolition', if the times of 'forced demolition' reach three times, the 'forced demolition' is proved to have more times in the past historical work order and is not a word which happens occasionally, and the electronic device stores the 'forced demolition' in a white list corresponding to the violation building.
The electronic equipment can also judge whether the forcible entry meets the two first preset conditions at the same time, so as to determine whether the forcible entry belongs to a new word in the field of illegal buildings.
In a possible implementation manner of the embodiment of the present application, the method further includes step S110 (not shown in the figure), step S111 (not shown in the figure), step S112 (not shown in the figure), and step S113 (not shown in the figure), wherein step S110 may be executed after step S1053, wherein,
and S110, acquiring a historical word set.
The historical word set is a set of required words determined in a first preset time period in the past.
For the embodiment of the application, the words obtained in the past first preset time period are stored in the historical word set, and the historical word set can be a word set corresponding to one field, for example, a violation building corresponds to one historical word set, and road traffic corresponds to one historical word set. The historical word set may also be a desired word set for all domains, and is not limited herein. The first preset time period may be a past half year, a past year, or the like, and may also be a time period from the determination of the first required word to the current time.
And S111, determining a comparison word set from the historical word set.
And the comparison word set is a set of required words in the history word set within a second preset time period.
For the embodiment of the application, the comparison word set is a part of words in the history word set, the comparison word set may determine the words in the comparison word set according to a second preset time period set by the processing person, for example, the second preset time period is a past week, and the electronic device obtains the words obtained in the past week from the history word set to form the comparison word set.
And S112, determining the words in the comparison word set which meet a second preset condition.
Wherein the second preset condition comprises:
and sorting the front preset items according to the occurrence times from large to small.
For the embodiment of the application, the historical word set is taken as an example of the historical word set corresponding to the illegal building field, and it is assumed that the words appearing in the past week have the condition of existence for 50 times, the illegal building appears for 45 times, the illegal building appears for 30 times, and the starting appears for 40 times. The first two items are assumed as the first preset items, wherein the word with the largest occurrence number is 'existence', and then 'illegal building', so that the condition that the user hot spot may be 'existence of illegal building' in the past week is explained.
And S113, outputting hotspot change information based on the words meeting the first preset condition and the required word set.
For the embodiment of the present application, taking the word obtained in step S106 as an example, the word with the largest occurrence number in the required word set is "detached". According to the word "exist" with the most appearance in the comparison word set in step S112 as an example, the hotspot before the week is changed from "illegal building exist" to "removal". The hotspot change information may be text information of 'illegal building existence' to 'removal' of the hotspot concerned in the past week, may also be image information as shown in fig. 2, and may also be hotspot change information in other forms. The electronic device can control the display devices such as the display screen and the touch screen to display the hotspot change text information or the hotspot change image information, and can also control the voice broadcasting devices such as the loudspeaker to play the voice information that 'the violation building exists in the past' is changed into 'removal' after the hotspot is concerned in a week, and can also output hotspot change information in other forms, which is not limited herein.
The above embodiment introduces a method for extracting words from work order data from the perspective of a method flow, and the following embodiment introduces a device for extracting words from work order data from the perspective of a virtual module or a virtual unit, which is described in detail in the following embodiment.
The embodiment of the present application provides a device 20 for extracting terms from work order data, as shown in fig. 3, the device 20 for extracting terms from work order data may specifically include:
a text acquisition module 201, configured to acquire work order text information;
the text processing module 202 is configured to process the work order text information to obtain at least one Chinese character combination to be analyzed;
the calculation module 203 is configured to calculate a corresponding solidity value and a corresponding information entropy value for each Chinese character combination to be analyzed, where the information entropy value includes at least one of a left information entropy value and a right information entropy value;
a first word determining module 204, configured to determine an initial word set based on an information entropy value and a freezing value corresponding to each to-be-analyzed Chinese character combination;
and the word screening module 205 is configured to screen the initial word set to obtain a required word set.
For the embodiment of the application, the work order text information is content issued by the user, the text acquisition module 201 acquires the work order text information issued by the user, and the work order text information is composed of words, so that the text processing module 202 needs to process the work order text information to obtain the Chinese character combination to be analyzed before the words in the work order text information are obtained. The Chinese character combination to be analyzed comprises words. The calculation module 203 calculates the information entropy and the solidification value corresponding to each of the chinese character combinations, and after calculating the information entropy and the solidification value corresponding to each of the chinese character combinations, the first term determination module 204 determines all the chinese character combinations, i.e., the initial term set, belonging to the term according to the information entropy and the solidification value of each of the chinese character combinations, whether a chinese character combination belongs to the term. The term filtering module 205 then filters the initial term set to obtain the key terms required by the processing personnel.
In a possible implementation manner of the embodiment of the present application, when the text processing module 202 processes the work order text information to obtain at least one to-be-analyzed chinese character combination, the text processing module is specifically configured to:
filtering non-Chinese character information in the text information to obtain a character sequence;
forward scanning the character sequence according to a preset step length by a preset window length to obtain at least one Chinese character combination;
judging whether the occurrence frequency of each Chinese character combination reaches a first preset frequency threshold value or not;
and determining the Chinese character combination with the occurrence frequency reaching a first preset frequency threshold value as the Chinese character combination to be analyzed.
In a possible implementation manner of the embodiment of the present application, when determining the initial word set based on the information entropy and the freezing value corresponding to each chinese character combination to be analyzed, the first word determining module 204 is specifically configured to:
determining an information entropy threshold interval corresponding to each Chinese character combination to be analyzed based on the length of each Chinese character combination to be analyzed;
judging whether the information entropy value of each Chinese character combination to be analyzed is positioned in the corresponding information entropy threshold interval;
generating a first word set based on the Chinese character combination of which the information entropy value is positioned in the corresponding information entropy threshold value interval;
comparing the freezing degree value of each Chinese character combination to be analyzed in the first word set with a preset freezing degree threshold value;
and generating an initial word set based on the Chinese character combination with the coagulation degree value larger than a preset coagulation degree threshold value.
In a possible implementation manner of the embodiment of the present application, when the word screening module 205 screens the initial word set to obtain a required word set, the word screening module is specifically configured to:
filtering out the stop words in the initial word set based on a preset stop word bank to obtain a second word set except the stop words, wherein the stop words are words hitting the preset stop word bank;
synonymy replacing words hitting the preset synonym library in the second word set based on the preset synonym library to obtain a third word set;
and determining the words hit on the preset white list in the third word set as required words, and generating the required word set based on the required words.
In a possible implementation manner of the embodiment of the present application, the apparatus 20 further includes:
the sorting module is used for sorting the occurrence times of each required word in the required word set;
and the first output module is used for outputting the required words in the required word set according to the sequencing result.
In a possible implementation manner of the embodiment of the present application, the apparatus 20 further includes:
the searching module is used for searching the selected words in the third word set based on the corresponding relation, the corresponding relation is the corresponding relation between each word in the third word set and the corresponding work order text information, the corresponding relation is established after the third word set is obtained, and the selected words are words which do not hit the preset white list in the work order text information corresponding to the required words;
the word storage module is used for storing any selected word to a preset white list when any selected word meets a first preset condition;
the first preset condition includes at least one of:
the occurrence frequency of any selected word in the corresponding work order text information reaches a second preset frequency threshold value;
and the occurrence frequency of the third word set of the hit history of any selection word reaches a third preset frequency threshold value.
In a possible implementation manner of the embodiment of the present application, the apparatus 20 further includes:
the historical word set acquisition module is used for acquiring a historical word set, wherein the historical word set is a set of required words determined in a first preset time period in the past;
the second word determining module is used for determining a comparison word set from the historical word set, wherein the comparison word set is a set of required words in the historical word set within a second preset time period;
a third word determining module for determining words satisfying a second preset condition in the comparison word set,
the second preset condition includes:
the front preset items are sorted from large to small according to the occurrence times;
and the second output module is used for outputting the hotspot change information based on the words meeting the first preset condition and the required word set.
In the embodiment of the present application, the first word determining module 204, the second word determining module, and the third word determining module may be the same word determining module, may be different word determining modules, or may be partially the same word determining module. The first output module and the second output module may be the same output module or different output modules.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus 20 for extracting words from work order data described above may refer to the corresponding process in the foregoing method embodiment, and will not be described herein again.
In an embodiment of the present application, an electronic device is provided, and as shown in fig. 4, an electronic device 30 shown in fig. 4 includes: a processor 301 and a memory 303. Wherein processor 301 is coupled to memory 303, such as via bus 302. Optionally, the electronic device 30 may also include a transceiver 304. It should be noted that the transceiver 304 is not limited to one in practical applications, and the structure of the electronic device 30 is not limited to the embodiment of the present application.
The Processor 301 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 301 may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 302 may include a path that transfers information between the above components. The bus 302 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 302 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 4, but this does not represent only one bus or one type of bus.
The Memory 303 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.
The memory 303 is used for storing application program codes for executing the scheme of the application, and the processor 301 controls the execution. The processor 301 is configured to execute application program code stored in the memory 303 to implement the aspects illustrated in the foregoing method embodiments.
Among them, electronic devices include but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. But also a server, etc. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments. Compared with the related technology, the work order text information is the content issued by the user in the embodiment of the application, and after the work order text information issued by the user is obtained, the work order text information is composed of words, so that the words in the work order text information need to be processed firstly to obtain the Chinese character combination to be analyzed. The Chinese character combination to be analyzed comprises words. And then calculating an information entropy value and a solidification value corresponding to each Chinese character combination, determining all Chinese character combinations belonging to the words according to the information entropy value and the solidification value of each Chinese character combination, namely an initial word set, after whether one Chinese character combination belongs to the words is related to the self information entropy value and the solidification value of the word or not and calculating the information entropy value and the solidification value corresponding to each Chinese character combination. And then screening the initial word set, thereby being convenient for obtaining key words required by processing personnel.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless otherwise indicated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims (7)

1. A method for extracting words from work order data is characterized by comprising the following steps:
acquiring work order text information;
processing the work order text information to obtain at least one Chinese character combination to be analyzed;
calculating a corresponding coagulation value and a corresponding information entropy value of each Chinese character combination to be analyzed, wherein the information entropy value comprises at least one of a left information entropy value and a right information entropy value;
determining an initial word set based on the information entropy value and the solidification value corresponding to each Chinese character combination to be analyzed;
screening the initial word set to obtain a required word set;
wherein, the determining an initial word set based on the information entropy value and the freezing value corresponding to each Chinese character combination to be analyzed comprises:
determining an information entropy threshold interval corresponding to each Chinese character combination to be analyzed based on the length of each Chinese character combination to be analyzed;
judging whether the information entropy value of each Chinese character combination to be analyzed is positioned in the corresponding information entropy threshold interval;
generating a first word set based on the Chinese character combination of which the information entropy value is positioned in the corresponding information entropy threshold interval;
comparing the freezing degree value of each Chinese character combination to be analyzed in the first word set with a preset freezing degree threshold value;
generating an initial word set based on the Chinese character combination with the coagulation degree value larger than a preset coagulation degree threshold value;
wherein, the screening the initial word set to obtain the required word set comprises:
filtering out the stop words in the initial word set based on a preset stop word bank to obtain a second word set except the stop words, wherein the stop words are words hitting the preset stop word bank;
synonymy replacing words hitting the preset synonym library in the second word set based on a preset synonym library to obtain a third word set;
determining the words hit in a preset white list in the third word set as required words, and generating a required word set based on the required words;
searching for a drop word in a third word set based on a corresponding relationship, wherein the corresponding relationship is the corresponding relationship between each word in the third word set and corresponding work order text information, the corresponding relationship is created after the third word set is obtained, and the drop word is a word which does not hit a preset white list in the work order text information corresponding to the required word;
if any selected word meets a first preset condition, storing the selected word to the preset white list;
the first preset condition includes at least one of:
the occurrence frequency of any selected word in the corresponding work order text information reaches a second preset frequency threshold value;
and the occurrence frequency of a third word set of the hit history of any selected word reaches a third preset frequency threshold value.
2. The method for extracting words from work order data according to claim 1, wherein the processing the work order text information to obtain at least one chinese character combination to be analyzed comprises:
filtering non-Chinese character information in the text information to obtain a character sequence;
forward scanning the character sequence according to a preset step length by a preset window length to obtain at least one Chinese character combination;
judging whether the occurrence frequency of each Chinese character combination reaches a first preset frequency threshold value or not;
and determining the Chinese character combination with the occurrence frequency reaching a first preset frequency threshold value as the Chinese character combination to be analyzed.
3. The method of claim 1, wherein the method further comprises:
sequencing the occurrence times of each required word in the required word set;
and outputting the required words in the required word set according to the sequencing result.
4. The method of claim 1, wherein the method further comprises:
acquiring a historical word set, wherein the historical word set is a set of required words determined in a first preset time period in the past;
determining a comparison word set from the historical word set, wherein the comparison word set is a set of required words in a second preset time period in the historical word set;
determining words in the comparison word set which meet a second preset condition,
the second preset condition includes:
the front preset items are sorted from large to small according to the occurrence times;
and outputting hot spot change information based on the words meeting the second preset condition and the required word set.
5. A device for extracting words from work order data is characterized by comprising:
the text acquisition module is used for acquiring work order text information;
the text processing module is used for processing the work order text information to obtain at least one Chinese character combination to be analyzed;
the calculation module is used for calculating a corresponding freezing value and a corresponding information entropy value of each Chinese character combination to be analyzed, wherein the information entropy value comprises at least one of a left information entropy value and a right information entropy value;
the word determining module is used for determining an initial word set based on the information entropy value and the solidification value corresponding to each Chinese character combination to be analyzed;
the word screening module is used for screening the initial word set to obtain a required word set;
when determining the initial word set based on the information entropy value and the freezing value corresponding to each Chinese character combination to be analyzed, the first word determination module is specifically configured to:
determining an information entropy threshold interval corresponding to each Chinese character combination to be analyzed based on the length of each Chinese character combination to be analyzed;
judging whether the information entropy value of each Chinese character combination to be analyzed is positioned in the corresponding information entropy threshold interval;
generating a first word set based on the Chinese character combination of which the information entropy value is positioned in the corresponding information entropy threshold interval;
comparing the freezing degree value of each Chinese character combination to be analyzed in the first word set with a preset freezing degree threshold value;
generating an initial word set based on the Chinese character combination with the solidification value larger than a preset solidification threshold;
wherein, the word screening module is specifically used for screening the initial word set to obtain a required word set:
filtering out the stop words in the initial word set based on a preset stop word bank to obtain a second word set except the stop words, wherein the stop words are words hitting the preset stop word bank;
synonymy replacing words hitting the preset synonym library in the second word set based on a preset synonym library to obtain a third word set;
determining the words hit in the third word set on a preset white list as required words, and generating a required word set based on the required words;
the searching module is used for searching for the selected words in the third word set based on a corresponding relation, the corresponding relation is the corresponding relation between each word in the third word set and the corresponding work order text information, the corresponding relation is established after the third word set is obtained, and the selected words are words which do not hit a preset white list in the work order text information corresponding to the required words;
the word storage module is used for storing any selected word to the preset white list when the selected word meets a first preset condition;
the first preset condition includes at least one of:
the occurrence frequency of any one of the selected words in the corresponding work order text information reaches a second preset frequency threshold value;
and the occurrence frequency of a third word set of the hit history of any selected word reaches a third preset frequency threshold value.
6. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: a method of performing work order data extraction terms according to any one of claims 1 to 4.
7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a method for extracting words from work order data according to any one of claims 1 to 4.
CN202210287345.4A 2022-03-23 2022-03-23 Method, device, equipment and storage medium for extracting words from work order data Active CN114385792B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210287345.4A CN114385792B (en) 2022-03-23 2022-03-23 Method, device, equipment and storage medium for extracting words from work order data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210287345.4A CN114385792B (en) 2022-03-23 2022-03-23 Method, device, equipment and storage medium for extracting words from work order data

Publications (2)

Publication Number Publication Date
CN114385792A CN114385792A (en) 2022-04-22
CN114385792B true CN114385792B (en) 2022-06-24

Family

ID=81205167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210287345.4A Active CN114385792B (en) 2022-03-23 2022-03-23 Method, device, equipment and storage medium for extracting words from work order data

Country Status (1)

Country Link
CN (1) CN114385792B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108038119A (en) * 2017-11-01 2018-05-15 平安科技(深圳)有限公司 Utilize the method, apparatus and storage medium of new word discovery investment target
CN108052500A (en) * 2017-12-13 2018-05-18 北京数洋智慧科技有限公司 A kind of text key message extracting method and device based on semantic analysis
CN108182174A (en) * 2017-12-27 2018-06-19 掌阅科技股份有限公司 New words extraction method, electronic equipment and computer storage media
CN113157903A (en) * 2020-12-28 2021-07-23 国网浙江省电力有限公司信息通信分公司 Multi-field-oriented electric power word stock construction method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020422B (en) * 2018-11-26 2020-08-04 阿里巴巴集团控股有限公司 Feature word determining method and device and server

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108038119A (en) * 2017-11-01 2018-05-15 平安科技(深圳)有限公司 Utilize the method, apparatus and storage medium of new word discovery investment target
CN108052500A (en) * 2017-12-13 2018-05-18 北京数洋智慧科技有限公司 A kind of text key message extracting method and device based on semantic analysis
CN108182174A (en) * 2017-12-27 2018-06-19 掌阅科技股份有限公司 New words extraction method, electronic equipment and computer storage media
CN113157903A (en) * 2020-12-28 2021-07-23 国网浙江省电力有限公司信息通信分公司 Multi-field-oriented electric power word stock construction method

Also Published As

Publication number Publication date
CN114385792A (en) 2022-04-22

Similar Documents

Publication Publication Date Title
CN106897428B (en) Text classification feature extraction method and text classification method and device
US8095547B2 (en) Method and apparatus for detecting spam user created content
CN110275965B (en) False news detection method, electronic device and computer readable storage medium
WO2017045443A1 (en) Image retrieval method and system
CN108090216B (en) Label prediction method, device and storage medium
CN112446210B (en) User gender prediction method and device and electronic equipment
CN111930962A (en) Document data value evaluation method and device, electronic equipment and storage medium
CN110929145A (en) Public opinion analysis method, public opinion analysis device, computer device and storage medium
CN113779481B (en) Method, device, equipment and storage medium for identifying fraud websites
CN110457672A (en) Keyword determines method, apparatus, electronic equipment and storage medium
CN107085568A (en) A kind of text similarity method of discrimination and device
WO2021135104A1 (en) Multi-source data-based object pushing method and apparatus, device, and storage medium
CN114416998A (en) Text label identification method and device, electronic equipment and storage medium
CN108875050B (en) Text-oriented digital evidence-obtaining analysis method and device and computer readable medium
US8838616B2 (en) Server device for creating list of general words to be excluded from search result
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
CN114385792B (en) Method, device, equipment and storage medium for extracting words from work order data
CN110618797B (en) Method and device for generating character trotting horse lamp and terminal equipment
CN109145307B (en) User portrait recognition method, pushing method, device, equipment and storage medium
CN109670183B (en) Text importance calculation method, device, equipment and storage medium
CN110941638B (en) Application classification rule base construction method, application classification method and device
CN113691525A (en) Traffic data processing method, device, equipment and storage medium
CN105787101A (en) Information processing method and electronic equipment
TWI451277B (en) Search tags visualization system and method therefore

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant