CN113268978A - Information generation method and device and electronic equipment - Google Patents

Information generation method and device and electronic equipment Download PDF

Info

Publication number
CN113268978A
CN113268978A CN202010097793.9A CN202010097793A CN113268978A CN 113268978 A CN113268978 A CN 113268978A CN 202010097793 A CN202010097793 A CN 202010097793A CN 113268978 A CN113268978 A CN 113268978A
Authority
CN
China
Prior art keywords
dictionary
entry
word segmentation
closeness
primary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010097793.9A
Other languages
Chinese (zh)
Inventor
方菲
刁节铮
鲁涛
梁颖
王蟒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN202010097793.9A priority Critical patent/CN113268978A/en
Publication of CN113268978A publication Critical patent/CN113268978A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides an information generation method, an information generation device and electronic equipment, wherein the method comprises the following steps: generating a primary dictionary; determining the closeness corresponding to each entry in the primary dictionary; screening the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary; and then through compactness screening vocabulary entry, can avoid introducing wrong vocabulary entry or nonstandard vocabulary entry to the dictionary to improve the quality of dictionary.

Description

Information generation method and device and electronic equipment
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to an information generating method and apparatus, and an electronic device.
Background
At present, word segmentation is used in a plurality of information processing processes; for example, after receiving the search term, the search system performs word segmentation on the search term, and then searches for the inverted search based on the word segmentation result to determine the search result. For another example, when translating, firstly segmenting words of the text to be translated, and then translating based on the segmented text; and so on.
In the prior art, a dictionary-based method is one of the commonly used word segmentation methods, namely, the information to be processed is matched with entries in a dictionary to segment words of the information to be processed; therefore, the quality of the dictionary affects the quality of the segmented word. At present, most dictionaries are generated according to word frequency of entries, left-right entropy of entry segments and the like, wrong entries or irregular entries are introduced into the dictionaries, and the quality of the dictionaries is low.
Disclosure of Invention
The embodiment of the invention provides an information generation method for generating a high-quality dictionary.
Correspondingly, the embodiment of the invention also provides an information generation device and electronic equipment, which are used for ensuring the realization and application of the method.
In order to solve the above problem, an embodiment of the present invention discloses an information generating method, which specifically includes: generating a primary dictionary; determining the closeness corresponding to each entry in the primary dictionary; and screening the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary.
Optionally, the determining the closeness corresponding to each entry in the primary dictionary includes: aiming at each entry in the primary dictionary, acquiring user search behavior data corresponding to the entry; and calculating the closeness corresponding to the entries according to the user searching behavior data corresponding to the entries.
Optionally, the term includes M word segmentation segments, and the calculating the closeness corresponding to the term according to the user search behavior data corresponding to the term includes: dividing M word segmentation segments in the entry into front M1 word segmentation segments and rear M2 word segmentation segments, wherein M, M1 and M2 are positive numbers, and M is the sum of M1 and M2; determining target search terms adjacent to the front M1 word segmentation segments and the rear M2 word segmentation segments and target search results clicked on a search result page of the target search terms according to user search behavior data corresponding to the terms; determining a first number of times that target search results of the top M1 word segmentation segments and the back M2 word segmentation segments in the title are clicked, and a second number of times that target search results of the top M1 word segmentation segments and the back M2 word segmentation segments in the title are not clicked; and determining the corresponding compactness of the entry according to the first times and the second times.
Optionally, the entry comprises at least one closeness; the screening of the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary comprises: for each entry in the primary dictionary, comparing the minimum closeness corresponding to the entry with a closeness threshold; and screening out the entries with the minimum closeness smaller than a closeness threshold value in the primary dictionary to obtain a target dictionary.
Optionally, the filtering the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary, further includes: deleting entries with the length exceeding the preset length in the primary dictionary; and/or deleting stop words in the primary dictionary.
Optionally, the generating a primary dictionary comprises: acquiring page contents of a plurality of web pages; performing word segmentation processing on the page content of each webpage to obtain corresponding word segmentation segments; sequentially combining the continuous M word segmentation segments into a vocabulary entry, and determining the word frequency and mutual information of each vocabulary entry; and generating a primary dictionary by using the entries with the word frequency higher than the first word frequency threshold value and the mutual information higher than the mutual information threshold value.
Optionally, the method further comprises: based on the primary dictionary, performing word segmentation processing on the page content of each webpage by adopting a maximum matching principle to obtain corresponding word segmentation segments; determining word frequency corresponding to word segmentation segments which do not exist in the primary dictionary, and adding word segmentation segments which do not exist in the primary dictionary and have word frequency higher than a second word frequency threshold value into the primary dictionary.
The embodiment of the invention also discloses an information generating device, which specifically comprises: the primary dictionary generating module is used for generating a primary dictionary; the closeness generating module is used for determining the closeness corresponding to each entry in the primary dictionary; and the target dictionary generating module is used for screening the entries in the primary dictionary according to the closeness of the entries in the primary dictionary to generate a target dictionary.
Optionally, the compactness generating module includes: the data acquisition sub-module is used for acquiring user search behavior data corresponding to each entry in the primary dictionary; and the closeness calculation submodule is used for calculating the closeness corresponding to the entry according to the user searching behavior data corresponding to the entry.
Optionally, the entry includes M word segmentation segments, and the closeness calculation sub-module is configured to divide the M word segmentation segments in the entry into front M1 word segmentation segments and rear M2 word segmentation segments, where M, M1 and M2 are both positive numbers, and M is a sum of M1 and M2; determining target search terms adjacent to the front M1 word segmentation segments and the rear M2 word segmentation segments and target search results clicked on a search result page of the target search terms according to user search behavior data corresponding to the terms; determining a first number of times that target search results of the top M1 word segmentation segments and the back M2 word segmentation segments in the title are clicked, and a second number of times that target search results of the top M1 word segmentation segments and the back M2 word segmentation segments in the title are not clicked; and determining the corresponding compactness of the entry according to the first times and the second times.
Optionally, the entry comprises at least one closeness; the target dictionary generation module includes: a comparison submodule, configured to compare, for each entry in the primary dictionary, a minimum closeness corresponding to the entry with a closeness threshold; and the screening submodule is used for screening out the entries of which the minimum compactness is smaller than the compactness threshold value in the primary dictionary to obtain the target dictionary.
Optionally, the target dictionary generating module further includes: the deleting submodule is used for deleting entries with the length exceeding the preset length in the primary dictionary; and/or deleting stop words in the primary dictionary.
Optionally, the primary dictionary generating module is specifically configured to acquire page contents of multiple web pages and perform word segmentation on the page contents of each web page to obtain corresponding word segmentation segments; sequentially combining the continuous M word segmentation segments into a vocabulary entry, and determining the word frequency and mutual information of each vocabulary entry; and generating a primary dictionary by using the entries with the word frequency higher than the first word frequency threshold value and the mutual information higher than the mutual information threshold value.
Optionally, the apparatus further comprises: the primary dictionary expansion module is used for performing word segmentation processing on the page content of each webpage by adopting a maximum matching principle based on the primary dictionary to obtain corresponding word segmentation segments; determining word frequency corresponding to word segmentation segments which do not exist in the primary dictionary, and adding word segmentation segments which do not exist in the primary dictionary and have word frequency higher than a second word frequency threshold value into the primary dictionary.
The embodiment of the invention also discloses a readable storage medium, and when the instructions in the storage medium are executed by a processor of the electronic equipment, the electronic equipment can execute the information generation method in any one of the embodiments of the invention.
An embodiment of the present invention also discloses an electronic device, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, and the one or more programs include instructions for: generating a primary dictionary; determining the closeness corresponding to each entry in the primary dictionary; and screening the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary.
Optionally, the determining the closeness corresponding to each entry in the primary dictionary includes: aiming at each entry in the primary dictionary, acquiring user search behavior data corresponding to the entry; and calculating the closeness corresponding to the entries according to the user searching behavior data corresponding to the entries.
Optionally, the term includes M word segmentation segments, and the calculating the closeness corresponding to the term according to the user search behavior data corresponding to the term includes: dividing M word segmentation segments in the entry into front M1 word segmentation segments and rear M2 word segmentation segments, wherein M, M1 and M2 are positive numbers, and M is the sum of M1 and M2; determining target search terms adjacent to the front M1 word segmentation segments and the rear M2 word segmentation segments and target search results clicked on a search result page of the target search terms according to user search behavior data corresponding to the terms; determining a first number of times that target search results of the top M1 word segmentation segments and the back M2 word segmentation segments in the title are clicked, and a second number of times that target search results of the top M1 word segmentation segments and the back M2 word segmentation segments in the title are not clicked; and determining the corresponding compactness of the entry according to the first times and the second times.
Optionally, the entry comprises at least one closeness; the screening of the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary comprises: for each entry in the primary dictionary, comparing the minimum closeness corresponding to the entry with a closeness threshold; and screening out the entries with the minimum closeness smaller than a closeness threshold value in the primary dictionary to obtain a target dictionary.
Optionally, the filtering the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary, further includes: deleting entries with the length exceeding the preset length in the primary dictionary; and/or deleting stop words in the primary dictionary.
Optionally, the generating a primary dictionary comprises: acquiring page contents of a plurality of web pages; performing word segmentation processing on the page content of each webpage to obtain corresponding word segmentation segments; sequentially combining the continuous M word segmentation segments into a vocabulary entry, and determining the word frequency and mutual information of each vocabulary entry; and generating a primary dictionary by using the entries with the word frequency higher than the first word frequency threshold value and the mutual information higher than the mutual information threshold value.
Optionally, further comprising instructions for: based on the primary dictionary, performing word segmentation processing on the page content of each webpage by adopting a maximum matching principle to obtain corresponding word segmentation segments; determining word frequency corresponding to word segmentation segments which do not exist in the primary dictionary, and adding word segmentation segments which do not exist in the primary dictionary and have word frequency higher than a second word frequency threshold value into the primary dictionary.
The embodiment of the invention has the following advantages:
in the embodiment of the invention, a primary dictionary can be generated and the corresponding compactness of each entry in the primary dictionary can be determined; then, the entries in the primary dictionary are screened according to the closeness of each entry in the primary dictionary to generate a target dictionary; and then through compactness screening vocabulary entry, can avoid introducing wrong vocabulary entry or nonstandard vocabulary entry to the dictionary to improve the quality of dictionary.
Drawings
FIG. 1 is a flow chart of the steps of one embodiment of an information generation method of the present invention;
FIG. 2 is a flow chart of the steps of an alternative embodiment of an information generation method of the present invention;
FIG. 3 is a block diagram of an embodiment of an information generating apparatus according to the present invention;
FIG. 4 is a block diagram of an alternative embodiment of an information generating apparatus of the present invention;
FIG. 5 illustrates a block diagram of an electronic device for information generation in accordance with an exemplary embodiment;
fig. 6 is a schematic structural diagram of an electronic device for information generation according to another exemplary embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
One of the core ideas of the embodiment of the invention is that the entries are screened out to generate a dictionary according to the compactness of adjacent segments in the entries; and then can avoid introducing wrong vocabulary entry or irregular vocabulary entry into the dictionary, improve the quality of dictionary.
Referring to fig. 1, a flowchart illustrating steps of an embodiment of an information generating method according to the present invention is shown, which may specifically include the following steps:
and step 102, generating a primary dictionary.
In the embodiment of the invention, the corpus of the required field can be acquired according to the requirement, and then the primary dictionary is generated by processing the acquired corpus such as word segmentation, primary screening (for example, screening according to word frequency) and the like. When a dictionary of the whole field needs to be generated, the linguistic data of the whole field can be obtained; when a dictionary of a certain specified domain needs to be generated, the corpus of the specified domain can be obtained.
And 104, determining the corresponding compactness of each entry in the primary dictionary.
And 106, screening the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary.
After generating the primary dictionary, for each entry in the primary dictionary, calculating closeness of the entry; and then, the entries in the primary dictionary are screened according to the closeness of the entries to generate a target dictionary. Wherein the closeness can be used to describe the closeness between adjacent segments in the entry; the higher the closeness of the entry, the lower the possibility of splitting the entry into smaller-granularity entries, the lower the possibility of errors occurring in the entry, and the higher the degree of specification of the entry. Conversely, the lower the closeness of an entry, the higher the probability of splitting the entry into smaller-granularity entries, the higher the probability of an entry being incorrect, and the lower the entry normality. Therefore, the embodiment of the invention can screen out the entries with low compactness in the primary dictionary, reserve the entries with high compactness in the primary dictionary, generate the target dictionary, further avoid introducing wrong entries or irregular entries into the dictionary, and improve the quality of the dictionary.
In summary, in the embodiments of the present invention, a primary dictionary may be generated, and closeness corresponding to each entry in the primary dictionary may be determined; then, the entries in the primary dictionary are screened according to the closeness of each entry in the primary dictionary to generate a target dictionary; and then through compactness screening vocabulary entry, can avoid introducing wrong vocabulary entry or nonstandard vocabulary entry to the dictionary to improve the quality of dictionary.
Referring to fig. 2, a flowchart illustrating steps of an alternative embodiment of the information generating method of the present invention is shown, which may specifically include the following steps:
in the embodiment of the invention, the page contents of a plurality of web pages can be acquired as the corpus to generate the primary dictionary. If a dictionary of the whole field needs to be generated, page content of a webpage of the whole field can be acquired as a corpus; if a dictionary of the designated field needs to be generated, the page content of the webpage of the designated field can be acquired as the corpus, which is not limited in the embodiment of the present invention. Wherein, referring to steps 202-208, a primary dictionary may be generated.
Step 202, obtaining page contents of a plurality of web pages.
And 204, performing word segmentation on the page content of each webpage to obtain corresponding word segmentation segments.
And step 206, sequentially combining the continuous M word segmentation segments into a vocabulary entry, and determining the word frequency and mutual information of each vocabulary entry.
And step 208, generating a primary dictionary of the specified field by using the entries with the word frequency higher than the first word frequency threshold and the mutual information higher than the mutual information threshold.
The page content may include a text, a title, an abstract, and the like of the page, which is not limited in this embodiment of the present invention.
In the embodiment of the invention, the page content of each webpage page can be divided into sentences, and then the sentences obtained by dividing the sentences are divided into words, so that the word dividing efficiency is improved. The page content may be divided according to punctuation marks, such as commas, periods, exclamation marks, question marks, and the like, which is not limited in this embodiment of the present invention. Then, counting the frequency of each statement and removing the duplication from all the statements; performing word segmentation on each sentence after the duplication removal to obtain a plurality of word segmentation segments; the method for segmenting each sentence after deduplication may include multiple ways, for example, segmenting each sentence by using a segmentation tool, which is not limited in this embodiment of the present invention.
In an example of the present invention, after obtaining the word segmentation segment corresponding to each sentence, M consecutive word segmentation segments in the word segmentation segment corresponding to each sentence may be sequentially combined to generate an entry corresponding to the sentence; correspondingly, each entry may include M participle segments. In an example of the present invention, the value of M is set to 1, that is, each participle segment of the sentence can be taken as an entry independently. Then, the value of M can be sequentially increased from 1 to a set value according to a set step length, and every time M is increased, the continuous M word segmentation segments can be sequentially combined to generate a vocabulary entry corresponding to the sentence; the setting step length may be set according to a requirement, for example, a setting value of 1 may be set according to a requirement, for example, a setting value of 4, which is not limited in this embodiment of the present invention. M is a positive integer, and both M and the set value are smaller than the total number of the corresponding word segmentation segments of the sentence. For example, if the setting value is 3, the sentence "weather today is really good", which is divided into three segments "today", "weather", and "really good". When M is 1, the terms "today", "weather" and "true and good" can be obtained; when M is 2, the terms "weather today" and "weather is really good" can be obtained; when M is 3, the entry "weather today is really good" can be obtained.
In the embodiment of the invention, after a plurality of entries are determined based on the webpage content of a plurality of webpage pages, the word frequency and mutual information of each entry can be determined; then, according to the word frequency and mutual information of each entry, screening out partial entries to generate a primary dictionary.
In an example of the present invention, for each entry corresponding to each sentence, the word frequency of the entry in the sentence may be determined based on the frequency of the sentence; then, based on the frequency of each entry in each sentence, counting the sum of word frequencies of each entry in all sentences to obtain the word frequency of the entry; and de-duplication of entries.
In the embodiment of the invention, the mutual information can be used for describing the degree of dependence between adjacent segments in the entry; the larger the mutual information is, the higher the possibility of entry word formation is, and conversely, the smaller the mutual information is, the lower the possibility of entry word formation is. In one example of the present invention, M word segmentation segments in each entry may be divided into front M1 word segmentation segments and rear M2 word segmentation segments; the M-M1 + M2 may then calculate mutual information corresponding to the terms according to the following expression:
Figure BDA0002385786900000081
the MI (M1, M2) is mutual information of the front M1 participle segments and the rear M2 participle segments, the prob (M1, M2) is the probability that the M1 participle segments and the rear M2 participle segments are adjacent in the corpus, the prob (M1) is the probability that the M1 participle segments occur in the corpus, and the prob (M2) is the probability that the M2 participle segments occur in the corpus.
In an example of the present invention, when M is greater than 2, M1 may be sequentially increased from 1 to (M-1), and M participle segments in the entry may be divided into front M1 participle segments and rear M2 participle segments; the corresponding mutual information is then calculated. When M is larger than 2, each entry can correspond to two or more mutual information; then, the average value of two or more mutual information corresponding to the entry can be used as the mutual information of the entry. Of course, other ways may also be used to determine the mutual information corresponding to each entry according to two or more pieces of mutual information corresponding to the entries, which is not limited in this embodiment of the present invention.
Then, the entries with the word frequency higher than the first word frequency threshold value and the mutual information higher than the mutual information threshold value are screened out from the entries corresponding to the page contents of all the web pages, and a primary dictionary is generated. The first word frequency threshold and the mutual information threshold may be set as required, which is not limited in the embodiments of the present invention.
In the embodiment of the present invention, if the dictionary of the designated field needs to be generated, the primary dictionary of the designated field may be generated according to the steps 202 to 208, and the primary dictionary of the whole field may be generated according to the steps 202 to 208. Then, aiming at each entry in the primary dictionary of the specified field, determining difference information of the entry in the primary dictionary of the specified field and the primary dictionary of the whole field according to the word frequency of the entry in the primary dictionary of the specified field and the word frequency of the entry in the primary dictionary of the whole field; and deleting the entries of which the difference information in the primary dictionary of the specified field is lower than the difference threshold value. The difference threshold may be set as required, which is not limited in this embodiment of the present invention. In one example of the present invention, the difference information of each entry in the primary dictionary of the specified domain and the primary dictionary of the full domain may be determined according to the following expression:
s=pv_site2/pv_rand
and s is the difference information of the vocabulary entry in the primary dictionary of the specified field and the primary dictionary of the whole field. pv _ site is the word frequency of the entry in the primary dictionary of the full domain, and pv _ rand is the word frequency of the entry in the primary dictionary of the specified domain.
In one embodiment of the invention, after the primary dictionary is generated, the primary dictionary can be expanded, and the comprehensiveness of entries in the dictionary is increased. In one example of the invention, one way to augment the primary lexicon may be to: based on the primary dictionary, performing word segmentation processing on the page content of each webpage by adopting a maximum matching principle to obtain corresponding word segmentation segments; determining word frequency corresponding to word segmentation segments which do not exist in the primary dictionary, and adding word segmentation segments which do not exist in the primary dictionary and have word frequency higher than a second word frequency threshold value into the primary dictionary. In an example of the present invention, the page content of each web page may be segmented again based on the generated primary dictionary; the method includes the steps that the webpage content of each webpage can be segmented based on the maximum matching principle, and corresponding segmentation segments are obtained. Then, word segmentation segments which do not exist in the primary dictionary and corresponding word frequency can be determined, and word segmentation segments which do not exist in the primary dictionary and have word frequency higher than a second word frequency threshold are added into the primary dictionary; the second word frequency threshold may be set as required, which is not limited in this embodiment of the present invention.
The closeness of each entry in the primary dictionary may then be determined in accordance with steps 210-218.
Step 210, obtaining user search behavior data corresponding to each entry in the primary dictionary.
In the embodiment of the invention, user searching behavior data corresponding to each entry in the primary dictionary can be obtained from the user searching behavior data in a set time period; the set time period may be set as required, for example, the last three months, and the embodiment of the present invention is not limited thereto. Then, according to the user search behavior data corresponding to the entry, the closeness corresponding to the entry is calculated, which may refer to steps 212 to 218.
Wherein the user search behavior data comprises data related to operations performed by a user during a search process; for example, the input search term, the search result clicked on the search result page corresponding to the search term, and the like, which are not limited in this embodiment of the present invention.
Step 212, dividing the M word segmentation segments in the entry into front M1 word segmentation segments and rear M2 word segmentation segments, wherein M, M1 and M2 are both positive numbers, and M is the sum of M1 and M2.
Step 214, determining target search terms adjacent to the front M1 word segmentation segments and the rear M2 word segmentation segments and target search results clicked on the search result pages of the target search terms according to the user search behavior data corresponding to the terms.
And step 216, determining a first number of times that the target search results of the top M1 word segmentation segments and the bottom M2 word segmentation segments in the title are clicked, and a second number of times that the target search results of the top M1 word segmentation segments and the bottom M2 word segmentation segments in the title are not clicked.
And step 218, determining the closeness corresponding to the entry according to the first times and the second times.
In the embodiment of the present invention, for each entry in the primary dictionary, the entry may be divided into front M1 word segmentation segments and rear M2 word segmentation segments by referring to the above manner, which is not described herein again. Then searching target search entries which comprise the front M1 word segmentation segments and the rear M2 word segmentation segments and are adjacent to the front M1 word segmentation segments and the rear M2 word segmentation segments from the user search behavior data corresponding to the entries; and the target search result clicked on the search result page of the target search entry. Then searching for target search results of which the titles comprise front M1 word segmentation segments and rear M2 word segmentation segments from all the target search results; and determining the first times of clicking target search results with the front M1 word segmentation segments adjacent to the rear M2 word segmentation segments in the title and the second times of clicking target search results with the front M1 word segmentation segments not adjacent to the rear M2 word segmentation segments in the title from the searched search results. Determining the closeness corresponding to the entry according to the first times and the second times; the following expression may be referred to:
Figure BDA0002385786900000111
wherein, T (M1, M2) is the closeness of the first M1 word segmentation segments and the last M2 word segmentation segments, num (M1-M2) determines the first times that the target search results of the top M1 word segmentation segments and the last M2 word segmentation segments in the title are clicked, and num (M1-M2) determines the second times that the target search results of the top M1 word segmentation segments and the last M2 word segmentation segments in the title are clicked.
Correspondingly, when M is greater than 2, the entry may correspond to two or more closeness degrees.
Then, the entries in the primary dictionary can be screened according to the closeness of each entry in the primary dictionary to generate a target dictionary; reference may be made to step 218-step 220:
step 220, aiming at each entry in the primary dictionary, comparing the minimum closeness corresponding to the entry with a closeness threshold value.
And step 222, screening out the entries with the minimum closeness smaller than a closeness threshold value in the primary dictionary to obtain a target dictionary.
In the embodiment of the invention, the minimum closeness corresponding to each entry in the primary dictionary can be determined; and then comparing the minimum compactness with a compactness threshold value, and judging the size of the minimum compactness and the compactness threshold value. The tightness threshold may be set as required, which is not limited in this embodiment of the present invention. If the minimum closeness of the entry is smaller than a closeness threshold value, screening the entry from the primary dictionary; and if the minimum compactness of the vocabulary entry is greater than or equal to the compactness threshold value, the vocabulary entry is kept in the primary dictionary, and the target dictionary is finally obtained.
In an alternative embodiment of the present invention, the entries in the primary dictionary may be initially filtered before performing step 220; wherein, can include: deleting entries with the length exceeding the preset length in the primary dictionary; and/or deleting stop words in the primary dictionary. The stop word can be a word which has no definite meaning and only makes sense when put into a complete sentence; for example: examples of the word include a mood assist word, an adverb, a preposition, a conjunctive word, and the like, which are not limited in the embodiments of the present invention. Of course, after the entries in the primary dictionary whose minimum closeness is smaller than the threshold closeness are filtered out in step 222, the entries in the primary dictionary may be further filtered: deleting entries with the length exceeding the preset length in the primary dictionary; and/or deleting stop words in the primary dictionary to obtain a target dictionary.
In summary, in the embodiments of the present invention, a primary dictionary may be generated, and closeness corresponding to each entry in the primary dictionary may be determined; then aiming at each entry in the primary dictionary, comparing the minimum closeness corresponding to the entry with a closeness threshold, and screening out the entry with the minimum closeness smaller than the closeness threshold in the primary dictionary to obtain a target dictionary; and then the vocabulary entry with high compactness among the word segmentation segments is reserved, the vocabulary entry with low direct compactness of the word segmentation segments is screened out, the wrong vocabulary entry or the irregular vocabulary entry can be effectively prevented from being introduced into the dictionary, and the quality of the dictionary is improved.
Secondly, in the embodiment of the invention, entries with the length exceeding the preset length in the primary dictionary can be deleted; and/or, deleting stop words in the primary dictionary; and then wrong entries or irregular entries are screened out from the primary dictionary, so that the wrong entries or irregular entries are prevented from being introduced into the dictionary, and the quality of the dictionary is further improved.
Further, in the embodiment of the present invention, for each entry in the primary dictionary, user search behavior data corresponding to the entry may be acquired, and then the closeness corresponding to the entry may be calculated according to the user search behavior data corresponding to the entry; and then through the compactness of confirming the vocabulary entry more accurately according to user's search behavior data to more effective entry of avoiding introducing wrong vocabulary entry or irregular vocabulary entry to the dictionary, further improve the quality of dictionary.
Thirdly, in the embodiment of the present invention, after a primary dictionary is generated, based on the primary dictionary, a maximum matching principle is adopted to perform word segmentation processing on the page content of each web page to obtain corresponding word segmentation segments, word frequency corresponding to the word segmentation segments that do not exist in the primary dictionary is determined, and the word segmentation segments that do not exist in the primary dictionary and have a word frequency higher than a second word frequency threshold are added to the primary dictionary; and further, the primary dictionary is effectively expanded, and the comprehensiveness of the entries in the primary dictionary is increased.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Referring to fig. 3, a block diagram of an embodiment of an information generating apparatus according to the present invention is shown, and may specifically include the following modules:
a primary dictionary generating module 302 for generating a primary dictionary;
an affinity generating module 304, configured to determine affinity corresponding to each entry in the primary dictionary;
and the target dictionary generating module 306 is configured to filter the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary.
Referring to fig. 4, a block diagram of an alternative embodiment of an information generating apparatus of the present invention is shown.
In an alternative embodiment of the present invention, the compactness generating module 304 includes:
a data obtaining sub-module 3042, configured to, for each entry in the primary dictionary, obtain user search behavior data corresponding to the entry;
the closeness calculation sub-module 3044 is configured to calculate the closeness corresponding to the entry according to the user search behavior data corresponding to the entry.
In an optional embodiment of the present invention, the entry includes M word segmentation fragments, and the closeness calculation sub-module 3044 is configured to divide the M word segmentation fragments in the entry into M1 front word segmentation fragments and M2 rear word segmentation fragments, where M, M1 and M2 are both positive numbers, and M is a sum of M1 and M2; determining target search terms adjacent to the front M1 word segmentation segments and the rear M2 word segmentation segments and target search results clicked on a search result page of the target search terms according to user search behavior data corresponding to the terms; determining a first number of times that target search results of the top M1 word segmentation segments and the back M2 word segmentation segments in the title are clicked, and a second number of times that target search results of the top M1 word segmentation segments and the back M2 word segmentation segments in the title are not clicked; and determining the corresponding compactness of the entry according to the first times and the second times.
In an alternative embodiment of the present invention, the entry comprises at least one closeness; the target dictionary generating module 306 includes:
a comparison submodule 3062 for comparing, for each entry in the primary dictionary, the minimum closeness corresponding to the entry with a closeness threshold;
the filtering submodule 3064 is configured to filter out the entry in the primary dictionary whose minimum closeness is smaller than the closeness threshold, so as to obtain the target dictionary.
In an optional embodiment of the present invention, the target dictionary generating module 306 further includes:
a deleting submodule 3066 for deleting the entries in the primary dictionary whose length exceeds the preset length; and/or deleting stop words in the primary dictionary.
In an optional embodiment of the present invention, the primary dictionary generating module 302 is specifically configured to obtain page contents of multiple web pages, perform word segmentation on the page contents of each web page, and obtain corresponding word segmentation segments; sequentially combining the continuous M word segmentation segments into a vocabulary entry, and determining the word frequency and mutual information of each vocabulary entry; and generating a primary dictionary by using the entries with the word frequency higher than the first word frequency threshold value and the mutual information higher than the mutual information threshold value.
In an optional embodiment of the present invention, the apparatus further comprises: the primary dictionary expansion module 308 is configured to perform word segmentation processing on the page content of each web page by using a maximum matching principle based on the primary dictionary to obtain corresponding word segmentation segments; determining word frequency corresponding to word segmentation segments which do not exist in the primary dictionary, and adding word segmentation segments which do not exist in the primary dictionary and have word frequency higher than a second word frequency threshold value into the primary dictionary.
In summary, in the embodiments of the present invention, a primary dictionary may be generated, and closeness corresponding to each entry in the primary dictionary may be determined; then, the entries in the primary dictionary are screened according to the closeness of each entry in the primary dictionary to generate a target dictionary; and then through compactness screening vocabulary entry, can avoid introducing wrong vocabulary entry or nonstandard vocabulary entry to the dictionary to improve the quality of dictionary.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
Fig. 5 is a block diagram illustrating a structure of an electronic device 500 for information generation according to an example embodiment. For example, the electronic device 500 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 5, electronic device 500 may include one or more of the following components: a processing component 502, a memory 504, a power component 506, a multimedia component 508, an audio component 510, an input/output (I/O) interface 512, a sensor component 514, and a communication component 516.
The processing component 502 generally controls overall operation of the electronic device 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 502 may include one or more processors 520 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 502 can include one or more modules that facilitate interaction between the processing component 502 and other components. For example, the processing component 502 can include a multimedia module to facilitate interaction between the multimedia component 508 and the processing component 502.
The memory 504 is configured to store various types of data to support operation at the device 500. Examples of such data include instructions for any application or method operating on the electronic device 500, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 504 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power component 506 provides power to the various components of the electronic device 500. Power components 506 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic device 500.
The multimedia component 508 includes a screen that provides an output interface between the electronic device 500 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 508 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 500 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 510 is configured to output and/or input audio signals. For example, the audio component 510 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 500 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 504 or transmitted via the communication component 516. In some embodiments, audio component 510 further includes a speaker for outputting audio signals.
The I/O interface 512 provides an interface between the processing component 502 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 514 includes one or more sensors for providing various aspects of status assessment for the electronic device 500. For example, the sensor assembly 514 may detect an open/closed state of the device 500, the relative positioning of components, such as a display and keypad of the electronic device 500, the sensor assembly 514 may detect a change in the position of the electronic device 500 or a component of the electronic device 500, the presence or absence of user contact with the electronic device 500, orientation or acceleration/deceleration of the electronic device 500, and a change in the temperature of the electronic device 500. The sensor assembly 514 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 514 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 516 is configured to facilitate wired or wireless communication between the electronic device 500 and other devices. The electronic device 500 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication section 514 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 514 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 504 comprising instructions, executable by the processor 520 of the electronic device 500 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of information generation, the method comprising: generating a primary dictionary; determining the closeness corresponding to each entry in the primary dictionary; and screening the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary.
Optionally, the determining the closeness corresponding to each entry in the primary dictionary includes: aiming at each entry in the primary dictionary, acquiring user search behavior data corresponding to the entry; and calculating the closeness corresponding to the entries according to the user searching behavior data corresponding to the entries.
Optionally, the term includes M word segmentation segments, and the calculating the closeness corresponding to the term according to the user search behavior data corresponding to the term includes: dividing M word segmentation segments in the entry into front M1 word segmentation segments and rear M2 word segmentation segments, wherein M, M1 and M2 are positive numbers, and M is the sum of M1 and M2; determining target search terms adjacent to the front M1 word segmentation segments and the rear M2 word segmentation segments and target search results clicked on a search result page of the target search terms according to user search behavior data corresponding to the terms; determining a first number of times that target search results of the top M1 word segmentation segments and the back M2 word segmentation segments in the title are clicked, and a second number of times that target search results of the top M1 word segmentation segments and the back M2 word segmentation segments in the title are not clicked; and determining the corresponding compactness of the entry according to the first times and the second times.
Optionally, the entry comprises at least one closeness; the screening of the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary comprises: for each entry in the primary dictionary, comparing the minimum closeness corresponding to the entry with a closeness threshold; and screening out the entries with the minimum closeness smaller than a closeness threshold value in the primary dictionary to obtain a target dictionary.
Optionally, the filtering the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary, further includes: deleting entries with the length exceeding the preset length in the primary dictionary; and/or deleting stop words in the primary dictionary.
Optionally, the generating a primary dictionary comprises: acquiring page contents of a plurality of web pages; performing word segmentation processing on the page content of each webpage to obtain corresponding word segmentation segments; sequentially combining the continuous M word segmentation segments into a vocabulary entry, and determining the word frequency and mutual information of each vocabulary entry; and generating a primary dictionary by using the entries with the word frequency higher than the first word frequency threshold value and the mutual information higher than the mutual information threshold value.
Optionally, the method further comprises: based on the primary dictionary, performing word segmentation processing on the page content of each webpage by adopting a maximum matching principle to obtain corresponding word segmentation segments; determining word frequency corresponding to word segmentation segments which do not exist in the primary dictionary, and adding word segmentation segments which do not exist in the primary dictionary and have word frequency higher than a second word frequency threshold value into the primary dictionary.
Fig. 6 is a schematic structural diagram of an electronic device 600 for information generation according to another exemplary embodiment of the present invention. The electronic device 600 may be a server, which may vary greatly due to different configurations or capabilities, and may include one or more Central Processing Units (CPUs) 622 (e.g., one or more processors) and memory 632, one or more storage media 630 (e.g., one or more mass storage devices) storing applications 642 or data 644. Memory 632 and storage medium 630 may be, among other things, transient or persistent storage. The program stored in the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 622 may be configured to communicate with the storage medium 630 to execute a series of instruction operations in the storage medium 630 on the server.
The server may also include one or more power supplies 626, one or more wired or wireless network interfaces 650, one or more input-output interfaces 658, one or more keyboards 656, and/or one or more operating systems 641, such as Windows Server, Mac OSXTM, UnixTM, LinuxTM, FreeBSDTM, etc.
An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs including instructions for: generating a primary dictionary; determining the closeness corresponding to each entry in the primary dictionary; and screening the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary.
Optionally, the determining the closeness corresponding to each entry in the primary dictionary includes: aiming at each entry in the primary dictionary, acquiring user search behavior data corresponding to the entry; and calculating the closeness corresponding to the entries according to the user searching behavior data corresponding to the entries.
Optionally, the term includes M word segmentation segments, and the calculating the closeness corresponding to the term according to the user search behavior data corresponding to the term includes: dividing M word segmentation segments in the entry into front M1 word segmentation segments and rear M2 word segmentation segments, wherein M, M1 and M2 are positive numbers, and M is the sum of M1 and M2; determining target search terms adjacent to the front M1 word segmentation segments and the rear M2 word segmentation segments and target search results clicked on a search result page of the target search terms according to user search behavior data corresponding to the terms; determining a first number of times that target search results of the top M1 word segmentation segments and the back M2 word segmentation segments in the title are clicked, and a second number of times that target search results of the top M1 word segmentation segments and the back M2 word segmentation segments in the title are not clicked; and determining the corresponding compactness of the entry according to the first times and the second times.
Optionally, the entry comprises at least one closeness; the screening of the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary comprises: for each entry in the primary dictionary, comparing the minimum closeness corresponding to the entry with a closeness threshold; and screening out the entries with the minimum closeness smaller than a closeness threshold value in the primary dictionary to obtain a target dictionary.
Optionally, the filtering the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary, further includes: deleting entries with the length exceeding the preset length in the primary dictionary; and/or deleting stop words in the primary dictionary.
Optionally, the generating a primary dictionary comprises: acquiring page contents of a plurality of web pages; performing word segmentation processing on the page content of each webpage to obtain corresponding word segmentation segments; sequentially combining the continuous M word segmentation segments into a vocabulary entry, and determining the word frequency and mutual information of each vocabulary entry; and generating a primary dictionary by using the entries with the word frequency higher than the first word frequency threshold value and the mutual information higher than the mutual information threshold value.
Optionally, further comprising instructions for: based on the primary dictionary, performing word segmentation processing on the page content of each webpage by adopting a maximum matching principle to obtain corresponding word segmentation segments; determining word frequency corresponding to word segmentation segments which do not exist in the primary dictionary, and adding word segmentation segments which do not exist in the primary dictionary and have word frequency higher than a second word frequency threshold value into the primary dictionary.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The above detailed description is provided for an information generating method, an information generating apparatus and an electronic device, and a specific example is applied in this document to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. An information generating method, comprising:
generating a primary dictionary;
determining the closeness corresponding to each entry in the primary dictionary;
and screening the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary.
2. The method of claim 1, wherein said determining the closeness of each entry in the primary dictionary comprises:
aiming at each entry in the primary dictionary, acquiring user search behavior data corresponding to the entry;
and calculating the closeness corresponding to the entries according to the user searching behavior data corresponding to the entries.
3. The method of claim 2, wherein the entry comprises M word segmentation segments, and the calculating the closeness corresponding to the entry according to the user search behavior data corresponding to the entry comprises:
dividing M word segmentation segments in the entry into front M1 word segmentation segments and rear M2 word segmentation segments, wherein M, M1 and M2 are positive numbers, and M is the sum of M1 and M2;
determining target search terms adjacent to the front M1 word segmentation segments and the rear M2 word segmentation segments and target search results clicked on a search result page of the target search terms according to user search behavior data corresponding to the terms;
determining a first number of times that target search results of the top M1 word segmentation segments and the back M2 word segmentation segments in the title are clicked, and a second number of times that target search results of the top M1 word segmentation segments and the back M2 word segmentation segments in the title are not clicked;
and determining the corresponding compactness of the entry according to the first times and the second times.
4. The method of claim 3, wherein the entry comprises at least one closeness; the screening of the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary comprises:
for each entry in the primary dictionary, comparing the minimum closeness corresponding to the entry with a closeness threshold;
and screening out the entries with the minimum closeness smaller than a closeness threshold value in the primary dictionary to obtain a target dictionary.
5. The method of claim 4, wherein the filtering entries in the primary dictionary based on closeness of entries in the primary dictionary to generate a target dictionary, further comprises:
deleting entries with the length exceeding the preset length in the primary dictionary; and/or the presence of a gas in the gas,
deleting stop words in the primary dictionary.
6. The method of claim 1, wherein generating a primary dictionary comprises:
acquiring page contents of a plurality of web pages;
performing word segmentation processing on the page content of each webpage to obtain corresponding word segmentation segments;
sequentially combining the continuous M word segmentation segments into a vocabulary entry, and determining the word frequency and mutual information of each vocabulary entry;
and generating a primary dictionary by using the entries with the word frequency higher than the first word frequency threshold value and the mutual information higher than the mutual information threshold value.
7. The method of claim 6, further comprising:
based on the primary dictionary, performing word segmentation processing on the page content of each webpage by adopting a maximum matching principle to obtain corresponding word segmentation segments;
determining word frequency corresponding to word segmentation segments which do not exist in the primary dictionary, and adding word segmentation segments which do not exist in the primary dictionary and have word frequency higher than a second word frequency threshold value into the primary dictionary.
8. An information generating apparatus, characterized by comprising:
the primary dictionary generating module is used for generating a primary dictionary;
the closeness generating module is used for determining the closeness corresponding to each entry in the primary dictionary;
and the target dictionary generating module is used for screening the entries in the primary dictionary according to the closeness of the entries in the primary dictionary to generate a target dictionary.
9. An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs including instructions for:
generating a primary dictionary;
determining the closeness corresponding to each entry in the primary dictionary;
and screening the entries in the primary dictionary according to the closeness of each entry in the primary dictionary to generate a target dictionary.
10. A readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the information generating method according to any of method claims 1-7.
CN202010097793.9A 2020-02-17 2020-02-17 Information generation method and device and electronic equipment Pending CN113268978A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010097793.9A CN113268978A (en) 2020-02-17 2020-02-17 Information generation method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010097793.9A CN113268978A (en) 2020-02-17 2020-02-17 Information generation method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN113268978A true CN113268978A (en) 2021-08-17

Family

ID=77227557

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010097793.9A Pending CN113268978A (en) 2020-02-17 2020-02-17 Information generation method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113268978A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484377A (en) * 2014-12-09 2015-04-01 百度在线网络技术(北京)有限公司 Generating method and device of substitute dictionaries
WO2015149533A1 (en) * 2014-03-31 2015-10-08 北京奇虎科技有限公司 Method and device for word segmentation processing on basis of webpage content classification
CN105159884A (en) * 2015-09-23 2015-12-16 百度在线网络技术(北京)有限公司 Method and device for establishing industry dictionary and industry identification method and device
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
CN105677664A (en) * 2014-11-19 2016-06-15 腾讯科技(深圳)有限公司 Compactness determination method and device based on web search
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 A kind of Chinese new word discovery method based on novel degree
CN109710947A (en) * 2019-01-22 2019-05-03 福建亿榕信息技术有限公司 Power specialty word stock generating method and device
WO2019113938A1 (en) * 2017-12-15 2019-06-20 华为技术有限公司 Data annotation method and apparatus, and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015149533A1 (en) * 2014-03-31 2015-10-08 北京奇虎科技有限公司 Method and device for word segmentation processing on basis of webpage content classification
CN105677664A (en) * 2014-11-19 2016-06-15 腾讯科技(深圳)有限公司 Compactness determination method and device based on web search
CN104484377A (en) * 2014-12-09 2015-04-01 百度在线网络技术(北京)有限公司 Generating method and device of substitute dictionaries
CN105159884A (en) * 2015-09-23 2015-12-16 百度在线网络技术(北京)有限公司 Method and device for establishing industry dictionary and industry identification method and device
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
WO2019113938A1 (en) * 2017-12-15 2019-06-20 华为技术有限公司 Data annotation method and apparatus, and storage medium
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 A kind of Chinese new word discovery method based on novel degree
CN109710947A (en) * 2019-01-22 2019-05-03 福建亿榕信息技术有限公司 Power specialty word stock generating method and device

Similar Documents

Publication Publication Date Title
CN110232137B (en) Data processing method and device and electronic equipment
EP3173948A1 (en) Method and apparatus for recommendation of reference documents
CN107621886B (en) Input recommendation method and device and electronic equipment
CN107564526B (en) Processing method, apparatus and machine-readable medium
CN109815396B (en) Search term weight determination method and device
CN107291772B (en) Search access method and device and electronic equipment
CN111651586B (en) Rule template generation method, rule template generation device and rule template generation medium for text classification
CN107424612B (en) Processing method, apparatus and machine-readable medium
CN109725736B (en) Candidate sorting method and device and electronic equipment
CN113987128A (en) Related article searching method and device, electronic equipment and storage medium
CN111739535A (en) Voice recognition method and device and electronic equipment
CN109918624B (en) Method and device for calculating similarity of webpage texts
CN113033163A (en) Data processing method and device and electronic equipment
CN111324214B (en) Statement error correction method and device
CN109887492B (en) Data processing method and device and electronic equipment
CN112329480A (en) Area adjustment method and device and electronic equipment
CN110780749B (en) Character string error correction method and device
CN109992790B (en) Data processing method and device for data processing
CN108108356B (en) Character translation method, device and equipment
CN108073294B (en) Intelligent word forming method and device for intelligent word forming
CN113268978A (en) Information generation method and device and electronic equipment
CN112987941B (en) Method and device for generating candidate words
CN108073566B (en) Word segmentation method and device and word segmentation device
CN108345590B (en) Translation method, translation device, electronic equipment and storage medium
CN110110292B (en) Data processing method and device for data processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination